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Lecturer: Professor Nike Sun 
Notes by: Andrew Lin 


Fall 2019 


Introduction 


The course website for this class can be found at the URL in [10]. That page notably contains a summary of topics 
covered, as well as some additional references and brief lecture notes from other similar classes. 

18.675 is the new numbering for 18.175; the material covered will be similar to previous years. The primary goal of 
the class is to cover basic objects and laws in probability in a formal mathematical framework, so some background 
in analysis and probability is strongly recommended. (In particular, if we are missing 18.100, we should talk to Professor 
Sun after class.) The main readings for this class will come from chapters 1-4 of our textbook [4], and readings for 
each chapter are included in the above summary of topics. There is some overlap with 18.125 (measure theory), 
particularly near the beginning, but most of the content will be significantly different. 

Grades are calculated mostly from homework (25%), exam 1 (35%), and exam 2 (35%). Class participation is the 
remaining 5% of the grade; while this is a graduate class, we are expected to attend. This is a large class that will 
probably get smaller over time, so attendance may be just checked manually. Alternatively, a short quiz may be given 
occasionally in class which won't be graded and is mostly to check whether we're all following the material (and also 
to have a formal record of who showed up). Realistically, attending class is highly correlated with doing well, so we 


should definitely do so! 


Remark 1. Notes for this class have been edited to include some details not covered in class. 
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We'll begin the class with some measure theory, which gives us a formal language for probability. On its own, measure 
theory Is a bit dry, so we'll try to motivate why it’s an important thing to study, starting with some elementary examples 
of probability and random variables. We should be familiar with these concepts from 18.600 or some other similar 


course: 


« Consider a p-biased coin, where each flip comes up heads with probability p and tails with probability (1 — p). 
Then the probability that six independent p-biased coin flips come up (H, H, T,H, T, H) in that order is just the 
product of the probabilities, which is p*(1 — p)*. And since there are (¢ 
tosses, the probability of having four heads when we flip six p-coins is (6) p*(1 — p/*. 


) ways to pick 4 heads among 6 coin 


* Suppose U, is a random variable distributed uniformly over the finite set {z, 2 see a} 1},and V, is an inde- 


pendent copy of U,. Then for any 1 < /,/ < n, we have statements of independence like 


P(U, iy <) P(U, 1\e(w=")=5. 
n n n n n 


But trying to compute probabilities like this becomes more complicated if the set of allowed states is infinite. For 


example, we might want to write down a formal definition for a random variable U which is uniform on [0, 1], but we 
can't have a positive probability that U is equal to any particular value. And we might also want to ask if U, converges 
to U as n + oo in some sense (since we seem to be populating [0, 1] uniformly). 


We'll now see an example that illustrates that this kind of problem Is trickier than it might initially seem: 


Problem 2 


Alice and Bob play a game with an infinite sequence of boxes. Alice puts a real number in each box, and Bob can 


reveal the number she put in every box except for one (which he can choose). “Show” that Bob can then guess 


the unseen number with 90 percent probability. 


“Solution”. Let S be the set of infinite sequences of real numbers, which can also be denoted R%. For two sequences 
x,y € S, say that x ~ y (in other words, x and y are equivalent) if they eventually agree, so x, = yp, for all n > N 
for some finite index N. This is an equivalence relation on S, so there is a quotient space S/ ~ of equivalence classes. 
Pick one representative z from each equivalence class, so that for any sequence x € S, there is a unique representative 
sequence z = z(x) such that z is equivalent to x. And since z and x are equivalent by definition, we can let n(x) be 
the first index for which x and z agree for all indices n(x) and onward. 

We can now describe Bob's strategy for guessing the unseen number. He splits the boxes into ten rows, where row 
k contains the boxes originally in spots congruent to k mod 10. We now have ten real-valued sequences x‘), --- , (20) 
corresponding to our ten rows — Bob picks one of these rows, uniformly at random (say row 10 without loss of 


generality). He then reveals the numbers in all nine other rows and computes 
N = max {n(x), ne ),0242; n(x} +1. 


(In other words, he looks at the first nine rows, and he finds the first index where each row agrees with its representative 
sequence. Then he finds the place N after which all rows agree.) If Bob then reveals every box in row 10 except for 
box N, he has enough information to deduce z(0) the representative sequence for x@°)_ Bob can then guess that the 
last box contains the representative sequence’s Nth element. a) 


(10) (10) 
N 


But xx” only disagrees with zy" if the value of n(x‘!°)) was the largest out of all of the n(x“)s. Because the 


row Bob chose was selected uniformly at random, and there’s at most a 10 percent chance that he got the largest of 


the n(x )s, his chance of guessing the correct number is at least 90 percent. 


This is a strange example — we can clearly get the probability to be arbitrarily close to 1 by replacing 10 with a 
larger number — so we've violated some laws of probability to get to this point. But we've constructed the problem in 
a way such that we don’t really know where we've messed up! That's why we'll start with the axioms and use measure 


theory to help us understand more clearly where we stand (and the issue will end up having to do with measurability). 


Example 3 


We'll start with a seemingly basic problem, defining the uniform random variable U ~ Unif[0, 1] from above. 


One way to formalize this random variable is to construct a function (which we call a measure) 4 on subsets of 
[0, 1], such that for any subset A, (A) is the probability that U € A. For example, for any 0 < a< b< 1, it’s natural 


to want 

u([a, b]) =b—a 
if we require U to be uniform. We also want to have some form of additivity, meaning that if A, B C [0, 1] are disjoint, 
then their disjoint union AL! B should have measure 


u(AU B) = pA) + u(B). 


By induction, this means that for any positive integer n, 


L () 4’ = Ds u(Ai). 


Generalizing this further, if we take the limiting disjoint union L]7_, Ai + LJ?2, Aj, it’s also natural that we may want 


countable additivity 
Le (U a = S- u(Ai). 
i=1 i=1 


Remark 4. On the other hand, note that w({0, 1]) = 1, but [0,1] is an uncountable union of points, and each point 
has measure u({p}) = w([p, p]) = 0. So we can’t ask our measure to have uncountable addivity, because 1 is not 


the sum of a bunch of zeros. 


So to summarize, our measure 44(A) = P(U € A) (which is meant to define the uniform random variable U) should 


have the following properties: 


+ It should respect lengths of intervals, meaning thatu([a, b]) = b— a for allO<a<b<1. 
- It should satisfy countable additivity (as defined above). 


- It should also be translation-invariant, meaning that if A C [0,1], and we define A, = A+ x to be A shifted 
rightward by x (mod 1), then w(A) = L(A,). 


These properties may not seem like much, so we might ask if this is enough to characterize U — in other words, if 
we're given any subset A, we want to use these three equations to determine (A). Let's try a slightly complicated 


example to see this in action: 


Example 5 


Consider the Cantor set C = (j=, Cn, where the C,, are iteratively defined as 


Co = [0,1], Cher = Cn \ (‘middle third” of each interval of C,). 


(Because the Cjs are nested, their intersection is indeed well-defined.) To find the measure of C, we know that 


(Co) = 1, and we also know that C,, is the disjoint union of Cp41 and the middle third of C,, so 
. : 2 
L(Cr41) = U(Cph) — (middle third of C,) = gl(Cn). 


This means that w(C,) = (3)” for all n, and taking the limit, the measure of the Cantor set is u(C) = 0. So this 
seems like encouraging evidence towards an answer of “yes, we can determine (A) in general,” but it turns out the 
Cantor set is actually pretty well-behaved compared to some other subsets of [0, 1]. In fact, we've actually been asking 


the wrong question — this function isn’t even well-defined for all subsets: 


Proposition 6 


There is no function w : P([0, 1]) + [0, 1] that satisfies all three of the properties we want in a measure. 


Proof (the Vitali construction). Define an equivalence relation on the real numbers by setting x ~ y ifx-yEQ 
This partitions the real line into equivalence classes (cosets of the form [x] = x +Q), where each one is a shift of Q by 
some real number. Similarly to Problem 2, we pick a representative z from each equivalence class, and for simplicity, 
we can pick all z to be in [0, 1] (since a number is always in the same class as its fractional part). 

Let A be the set of all representatives, which is a subset of [0, 1] (this is sometimes called the Vitali set). Since 
A, is A translated by q units, every real number in [0,1] is in Ag for some rational number q. Now for any rational 


q € [0, 1], we have Ag C [0, 2], so we can write 


Ag = (Aq M10, 1) LU (Ag E121) => fu(Ag)] = w(Ag A [0, 1) + w(Ag AE, 21) =[u(Ag 910, 1) + w(Ag_2 910, 1) | 


(Basically, think of taking the part of Ag that lies in [1, 2] and translating it to the left by 1 unit.) Now for any rational 
number q € [0, 1), Ag+n only intersects the interval [0,1] if nm = 0 or n=—1. Thus, we can split up the interval [0, 1] 


into contributions from different rational numbers gq and use countable additivity: 


1=n(10.1)=H) LJ (Aga [0,1]) U (Ag A (0, 1) 
q€Q,q¢E[0,1) 


So w(Ag-19[0, 1]) + H(Ag 9 [0, 1) 
q€Q, qe [0,1) 


s (Aq). 


q€Q,q¢[0,1) 


But now all Ags have equal measure by translation invariance, so 1 = )4eq qe[o,1) H(A), which is impossible because 


1 cannot be the sum of countably many equal numbers. 


The way measure theory deals with this problem is to not require that 4. be defined on all subsets of our space. 
As a historical note, this was a pretty surprising idea when it was first proposed, but it’s really the only thing we can 
do if we try to write formal definitions down. (We don’t really want to relax the conditions on our measure, so it’s 


better to just not define the measure on some pathological subsets. ) 


Definition 7 


A measure space is a triple (Q, F, 4) satisfying the following axioms: 
- ©, the state space or outcome space, is a nonempty set. 
+ F, a set of measurable subsets or events, is a o-field or o-algebra over 2 (those terms will be used 
interchangeably). In other words, F is a nonempty collection of subsets of Q which is closed under comple- 


mentation and countable union, so if A € F, then AS = 2 \ Ais also in F, and if A; € F, then Ue A; is 


also in F. 


+ u, the measure, is a map F — [0, oo] which is not infinite everywhere and is countably additive. In other 
words, if Aj € F are disjoint, then u (LI, Ai) = 72, u(Ai). 


The goal of our next two lectures will basically be to construct a uniform measure on [0, 1]: we'll define a measure 


space ([0, 1], £, u), where £ will be defined later, and similarly this will also allow us to define the measure space 


(R, Lr, 1). 


Definition 8 


A measure pL is a probability measure (which we will denote P) if 4(Q) = 1. 


Just working with the definitions, we can gather a few immediate consequences about measure spaces: 


Proposition 9 


Given a measure space (Q, F, 1), the following must hold: 


1. @ and © are in F, 


2. u(S) =0, 


3. (continuity from below) if Aj t A — that is, Ay C Ap C--- and A=, Aj — then p(Aj) t u(A), 


4. (continuity from above) if Aj | A— that is, Ay D Ap D--- and A=), Aj — and also p(Ai) < oo, then 
w(Aj) | uA), 


5. for any Q, F = {S, OQ} is a valid o-field, and so is P(Q). 


Proof. For (1), because F is nonempty, there is some event A in F. Then AS € F, so AUAS = Q must be in F, and 
so must 02° = @. For (2), there is some A such that (A) < oo, so by additivity u(AU @) = u(A) = w(A) + u(9), 
meaning p(2) = 0. 

For (3), we use countable additivity on A = |]?°, Bj, where By = Ai, Bo = Ap \ Ai, B3 = Az \ Ao, and so on. Then 


u(A) = > 7 u(Bn), 


and since uw(A,) = >>;_, 4(B;) are the partial sums of our infinite sum, 44(A;) indeed converges to ~(A). Similarly, 
for (4), define By = @, Bo = A, \ Ao, Bz = A, \ Az, and so on. All Bis have finite measure since 4(A;) is finite, and 
B, C By C--- C A, \A, so we can apply (3) on the Bjs to find that u(A; \ A;) t w(Ar \ A), so W(Aj) J (A) by 
additivity. (The assumption that w(A1) < co Is necessary — a counterexample with u(A1) = oo is Aj = [/, 00).) 


Finally, the two examples in (5) are both o-fields because they satisfy both closure axioms (complementation and 


countable union). 


If Q is a finite set, and we wanted to define a probability space in 18.600, we would define a probability mass 
function P(w) = a, for every w € Q, such that 0,69 a = 1. In our new notation, F = P(Q) is the set of all 
possible events, and P(A) = Sowe, 


But our discussion above shows that if we have a set like Q = [0, 1] or R, and we take F = P(Q), then we do have a 


a, for A. (We can check that this is a valid probability space from the axioms. ) 


valid o-field, but the Vitali set shows that we can’t define a measure pz on it. So next week, we'll basically ask how to 


restrict our set F to get a useful measure. 


Definition 10 


The Borel o-field Bp is the smallest o-field over R that contains Z, the set of all open intervals on R. 


Importantly, whenever we see the word “smallest” in a definition, we should ask whether we have a well-defined 
object (since there can be multiple “minimal” objects in general). But in this case, if {Fa : a € /} is a collection 


of o-fields over 2, then the intersection of the Fas is also a o-field. (Indeed, if a set A is the intersection of some 


o-fields, then both A and A* are in all of the o-fields, so A‘ is in the intersection. The same argument works for 


countable union.) So we can just let Bg be the intersection of all o-fields, and thus the Borel o-field is well-defined. 
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Fact 11 


The next two lectures in this class were given by Professor Subhabrata Sen. 


Our central question today is how to define the uniform measure on [0,1]. Basically, we will define a function 


on (some) subsets of [0, 1] with the requirements that 
+ u((a, b]) = b— a (choosing half-open intervals is just a matter of convention), 
+ given disjoint subsets {Aj : / > 1}, we have pw (LJ; Ar) = 0; u(Ai), and 
* if we define A, = (A+ x mod 1), then u(A,) = u(A) (translation invariance). 


Vitali’s construction tells us that we can’t do this with all subsets of [0,1], or in other words that there is no 
function w : 2!°4] —s [0, 1] that satisfies all three conditions above. (Here 24 is the powerset of A.) So we have to 
compromise and try to define wz on just a subcollection of the subsets of [0, 1]. Last time, we defined a measure space 
(Q, F, u), consisting of a nonempty state space, a o-field, and a measure. We also defined the Borel o-field to be 
the smallest o-field containing the open intervals (which are the sets where we really want the measure to be defined 


in a particular way). Notationally, this can be written as 
B=o(Z), T={(a,b]NR:a< b}. 


With this, we will try to construct the Lebesgue measure on [0, 1]. Our hope is that because we already know how 
to assign a measure to intervals, we can keep assigning measures to subsets of [0, 1] by building them from intervals. 
Unfortunately, this is not rigorous, because there’s no closed form for an arbitrary element of Bp, and in fact it’s not 
true that every set in the Borel o-algebra can be written as a countable union and/or intersection of open intervals! 

Instead, we do define u((a, b]) = b—a, and we do define yw (L];(a;, bi]) = 30;(b; — a7) for unions of disjoint intervals 
(a;, bi], but we're not quite done. Instead, we then need to know whether there is a consistent extension of our 
function yu to all sets in Bg. Since this idea will come up again later in the course, we'll make a slight generalization, 
defining the function F(x) = w((0, x]). (So we're primarily working with the case F(x) = x, but this works for a 


general measure .) 


Remark 12. Remember that measures in general do not need to be translation-invariant (it’s not part of Definition 7) — 
we just want translation invariance when we are trying to define a uniform measure. In particular, translation invariance 


will not hold if F(x) is nonlinear. 
From the properties of measures we've been studying, we can notice the following properties of F: 
+ For any x; > Xo, we have w((0, x1]) > u((0, x2]}) => F(x) > F(x), so F must be monotone. 


+ If x, | x, then the intervals (0, x,] | (0, x], meaning 4((0, x,]) L 4((0, x]) by continuity from above. Thus, we 


also have F(x,) | F(x) — in other words, F is right-continuous. 


We'll see soon why the function F is not required to be left-continuous, and we'll give functions F(x) of this form 


a name: 


Definition 13 


A Stieltjes measure function on R is a function F : R — R which is non-decreasing and right-continuous. 


Theorem 14 (Lebesgue-Stieltjes) 


For any Stieltjes measure function F, there is a unique measure 4 = Le on (R, Bp) satisfying u((a,b]) = 
F(b) — F(a) for all a,beE R. 


We'll do the proof in two steps (noting that we start off with the definition of 4 on the set of open intervals (a, b]: 


1. Show that there exists a unique extension from wz : Z > [0, oo] to uw: A = [0, oo], where 
A={ACR: Aisa finite disjoint union of sets in Z.} 


(In other words, we define « for any finite disjoint union of intervals.) 


2. Show that we can extend our measure from being defined on A to being defined on Br. 


Remark 15. Notice that A is closed under complements and finite unions (exercise), while the Borel o-algebra is 
closed under complements and countable unions. So by using A as an intermediate step, we're not quite at the 


o-algebra yet, but we're kind of close. 


Proof of step 1. The construction itself is pretty intuitive: given any subset A € A, we can write it as A = Les E; 


for some intervals Ejs in A, so we define 
n 
u(A) = 5° u(E;) 
j=1 


by additivity. But there are many different ways to write an element of A as a finite disjoint union of intervals, so we 


need to check that 
n m n . m 
A=| |= || Fe = SHE) = >. 
j=l k=1 j=l k=1 


Indeed, because all Ejs and Fxs are half-open intervals, we can take the common refinement of the partitions and 
notice that 
n n n m nom 

Se) => eer) = Ce Ga (1 A.) =>) MEO FL: 

j=l j=l j=l k=1 j=l k=1 
Importantly, the last equality w(E;O Ly, Fe) = oe, w(Ej ON Fx) comes not from additivity (because we don’t know 
that yet) but from the fact that the F,s form a disjoint union of A and thus must cover Ej. This can only be done 
with a finite set of half-open intervals if they line up back-to-back, and then the lengths will indeed add up to the 
total length of Ej. But then starting this calculation from >>, (Fx) also yields this result (after swapping the order 


of summation, which is okay because we have finite sums), so the two ways of computing the measure are the same, 


and (A) is consistently defined. Thus we have indeed extended pu from Z to A. 


We will now record a few properties of the measure which will be useful for next lecture: 


Proposition 16 


Our measure ~ = Lf has the following properties on A: 


1. uw is monotone and finitely additive over A, meaning that for any disjoint A,,---,A, € A, uw (LJ; Ai) = 


ae L(Ai). 


2. wis finitely subadditive over A, meaning that for any A;,---,A, € A, uw (U; Ai) =< lA: 


3. is countably subadditive over Z: if A, A; € Z (where / ranges over the positive integers) and A C U; Ai, 
then u(A) < 0; u(Ai). 


4. uw is countably additive on A. 


Property (4) is really what we care about most, because we want countable additivity on the final space Bg that 


we're defining uf on. 


Proof. It suffices to show (1) and (2) for n = 2 by induction. For (1), write Ay = Ey U---UE, and Ap = Fy U---UF,, 


where all E;s and Fjs are half-open intervals. By definition of u on A, we then have 


(i ei Net AiEy Sy ey ae) 


because the Fjs and Fs are all disjoint intervals, and then the two sums on the right-hand side are u(A1) and (Ao), 
proving additivity. Monotonicity then also follows because all terms here are nonnegative (lengths of intervals), so if 
Ac B, then u(B) = u(A) + u(B\ A) > uA). For (2), write Ay U Ao as the disjoint union of Ay and Ap \ Ai, so then 
by additivity and monotonicity from (1) and using the fact that Ao \ Ai € Ao, 


U(Ay U Az) = W(Ai U (Ap \ Ai)) = w(A1) + E(A2 \ At) S H(A1) + H(A2). 


For (3), we're working over Z, so write A = (a, b] and Aj = (a;, bj], and we're given that (a, b] C UP, (ai, bi]. (By 
definition, we have w(A) = F(b) — F(a) and u(A;) = F(b;) — F(aj).) Since F is right-continuous, there exist real 
numbers x > a, c¢; > bj such that F(x) — F(a) < € and F(c;) — F(bj) < 5. (Intuitively, this now lets us work with A 


as a closed interval and the Ajs as open intervals.) We still have a covering 
co 
Ix, b] c (ai. Ci) 
i=1 


because we've only made A smaller and the Ajs bigger. Since [x, b] is compact, and the right hand side is an open 


covering, there exists a finite subcover by the Heine-Borel theorem, which we will denote (with some abuse of notation) 


x, b] c We Ci) 


i=1 


Now by (1) and (2), because yp is finitely subadditive and monotone, 
n 
u((a, b]) < w((a,x]) + >> wa, ei) 
i=1 


= (F(x) — F(a)) + 9 (F(a) — F(ai)) 


n 


= (F(x) — F(a)) + )((F (ai) — F(bi)) + dF (bi) — F(a) < 26+ $F (bi) — F(ai)) 


j=l i=1 


(one € factor comes from F(x) — F(a), and the other comes from the sum of the geometric series }) 5;). But now 


taking € — 0, we indeed get countable subadditivity 


co 


u((a, b]) < S> w((aj, bil) 
i=1 

as desired. Finally, for (4), we are trying to prove that given A, A; € A, where A = []?°, Aj, we have u(A) = 
1 w(A;). Write 

n co n 

A=| |Au ( | | A =| [Auta 

i=1 i=n+1 i=1 

where Chi = Wai A; is actually a finite disjoint union of intervals as well (because it is the intervals of A, with the 


intervals in Ai,--- , A, removed). Thus, we can apply (1) to say that 


u(A) = > H(A) + b(Cr41) > MAI). 


Taking n to infinity, we find that 3 
u(A) > So u(Ai), 
i=1 
which gives one direction of the inequality. To show the upper bound, without loss of generality, we can assume that 
all A; € Z, because each A; is originally a finite disjoint union of intervals (so the whole disjoint union is a countable 
disjoint union of intervals). Also, because A is an element of A, we can separately write A = Lis E; with all Ej € Z. 


So now because A is entirely contained in (LU; Ai), we may write 


uA) = > HEI) = ye (« n (Us)) = ye (Ue 4) 


Since Ej and A; are both intervals, so is EF; Aj, and thus by (3) we have 


u(A) < SOS UME NA) = SOS MENA) 


j=l i=1 i=1 j=1 


by swapping the order of summation (okay because we have a nonnegative sum), and finally by (1) this yields 


WA) < $0 u(AN A) = So u(Ai), 


proving the other direction of the inequality and verifying (4). 


Our next step is to extend uw from A to the actual o-field Be. For now, let’s assume the measure we want to 


define is a finite measure (meaning that it never takes on the value of 00). We'll show the following result next time: 


Theorem 17 (Carathéodory extension theorem) 


Let A be an algebra over 22 (meaning that it is closed under complements and finite unions), and let uw: A — 


[0, 00) be a countably additive finite measure. Then there exists a unique measure on the generated o-algebra 


tt : o(A) = [0, co) which extends p. 
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Recall that the central object we're studying is the Borel o-field Bp, which is the smallest o-algebra generated by the 
intervals Z = {(a,b] NR: a< b}. Last time, we started working towards a proof that given any Stieltjes measure 
function (monotone, right-continuous) F, there exists a unique ue on (R, Bp) such that p ((a, b]) = F(b) — F(a). In 
particular, we extended the measure from Z to the algebra A consisting of finite disjoint unions of intervals in Z, and 


today, we'll further extend this to Br = o(A). 


Remark 18. A quick follow-up from last lecture — if we define F(x) = u((0,x]), then it is not true that F needs to 


be left-continuous. After all, when Xn + x, (0, Xn] + (0,x), while F(x) uses the interval (0, x] instead: 


1((0, Xn ]) t (C0, x)) A w((0, x]) = F(x). 


The example to keep in mind is that if we have a step function F(x) with a point mass, then (0,x) and (0, x] differ 


by the mass at point x, but this is a valid Stieltjes measure function. 


The extension of the measure wr (and thus one way to construct a uniform measure) will be done by proving the 
Carathéodory extension theorem, Theorem 17, today. We'll go through the proof in a slightly nonlinear way; we want 
to prove (1) that an extension exists and (2) that it is unique, but we'll do the latter first by introducing the 1-A 


theorem. 


Definition 19 


A m-system P on a set (2 is a nonempty collection of subsets of Q, such that for any A,B EP, ANBEP. 


For example, the collection of half-open intervals Z is a m-system. 


Definition 20 


A A-system £ on a set 2 is a collection of subsets of Q satisfying the following conditions: 
DAS /é, 
- For any A,B € £where BCA, A\ BEL. 


- If we have a sequence A, € £, and A, t A, then A € EL. 


All o-algebras are both a-systems (we have closure under intersections by definition) and A-systems (in any o- 
algebra, we have Q by taking any AU A‘, we have A \ B because it is also AM BS, and we have A = U,, An =A 


whenever A, ¢ A by countable union). And the converse also holds: 


Proposition 21 


If a A-system L Is also a m-system, then L is a o-algebra. 


10 


Proof. Because £ contains the whole set Q (by definition of a A-system), it is nonempty. And because 2 € CL, 
Q\A=AS ECL for any A € ZL, so it is also closed under complementation. Finally, for countable union, if £ contains 


Ai, Ao, then it also contains A; U Ap (the complement of Af M AS), and more generally it contains Ay U---U A, for 


any n, so it must also contain the limit U,, An- 


In other words, the definition of a o-algebra can be split into the verifying the intersections part (a-system) and 


the rest (A-system), and this turns out to be very helpful: 


Theorem 22 (Dynkin's 7-\ theorem) 


If P is a m-system, L is a A-system, and P C L£, then a(P) C CL. 


This theorem will have applications in various different contexts, so we should try to understand how it works. 


Proof. Without loss of generality, we can let £ be the smallest A-system containing P (this only makes o(P) C L 
harder to achieve). This is well-defined, because there is at least one A-system — the powerset of (2 — and intersections 


L£11L> of A-systems are A-systems as well: 
¢ Qis in both Ly and Lo, so it’s also in £19 Lo. 


- If A and B are both in £1 Lo, they’re both in £; and Lo, so A \ B is in both A-systems, meaning it’s also in 


their intersection. 


+ If all A,s (with A, t A) are in £1 Lo, then they're in both A-systems, so A is in both A-systems and therefore 


the intersection as well. 


Thus, the smallest £ is the intersection of all A-systems containing P. It suffices to show that L is also a 77-system, 
because that would mean CL is a o-field (by Proposition 21) containing P, so it must contain o(P), the smallest o-field 


containing P. In other words, we need to show that given any A,B € £L, AN BEC. Fix any Ae P, and define 
La={BEL:ANBEL}CEL, 


the subset of £ whose intersections with A are also in £. We can can check that Ly, is a A-system: 
*ANQ=AEL, SODE La. 
- If Bo C By are in La, meaning AN By,AN Bo € £, then (AN Bi) \ (AN Bo) = AN (B; \ Be) Is also in L, so 
B, \ Bo is also in La. 


- If we have an increasing sequence B, t B € L£, and all B, € La (that is, AN B, € £ for all 7), then ANBEL 
because (AN B,) t AN B, so B is also in La. 


In addition, because A € P and P Is a m-system, any element B € P alsohas AN BE PCE. Thus Ly contains 
P. But L is the smallest A-system containing P, so £ C Ly C L and the two sets are equal. Therefore, for all A € P 
and BEL, ANBE€EE. Now fix any B € £, and define 


Le ={AEL:ANBE L}. 


By the argument we just made, P C Lg (any intersection between an element of P and L is in £L). In addition, the 


exact same argument as for La shows that Lg must be a A-system, so Lg = L. So the intersection of any two 


elements of £ is in £, meaning that L is a m-system, which is what sufficed to prove our result. 
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The important takeaway from this proof is that measure-theoretic proofs are difficult because we don’t have a 
closed form for all measurable subsets. Instead, the right way Is to first verify that a certain subset P has the property 
we want, and then this theorem is powerful because it can bootstrap results from P to its o-algebra. In particular, we 


can now prove the uniqueness part of the Carathéodory extension theorem: 


Proof of uniqueness for Theorem 17. Suppose that there were two distinct extensions 1 and 2 of our measure 
from A to a(A). Then u(A) = 41(A) = uo(A) for all A € A by definition, and we can define 


£L={Beo(A): ui(B) = u2(B)} 


to be the subsets in o(A) where the extensions agree. (We want to show that L is the o-algebra o(A) itself.) We've 
established that A C L£, and we know that A is a m-system (intersections of finite collections of intervals are a finite 


collection of intervals). Now L is a A-system, as we can verify directly: 


+ The whole space Q is in £, because an algebra always contains AU AS = Q. 


Remark 23. There are actually some small technicalities here depending on the space Q. we choose — for example, 
[0,1] is not a finite union of half-open intervals because of the closed lower bound, so it's easiest to define a 
uniform measure on the real line first and restrict to [0,1] afterward. But we're also assuming that our measure 
only takes on finite values, so the way we actually need to set up our construction is to make this argument on 


(for example) (—n, n| for each positive integer n and take the union over all n. 


° If (A) = o(A), Hi(B) = w2(B), and B CA, then 11(A \ B) = po(A \ B) by additivity, so A \ B is also in ZL. 


+ If Ui(An) = U2(An) for all n in an increasing sequence {A,}, we can define By = Ay, Bo = Ap \ Ai, and so on. 
From the previous point, 41(B;) = Mo(B;) for all 7, so by countable additivity we have ti(LU Bi) = ue(U B)). 
Since ui(U B;) = ui(U Ai) = bi (A) (and same for 2), Wi(A) = beo(A) as desired. 


Thus the 7-A theorem tells us that o(A) = £, so “1 = Ue and the measures are the same. 


This proof technique is quite powerful — we only had to verify that 41(A) = f2(A) for a small subset of all 
measurable subsets, and £ helped us bootstrap that to a more general statement (even though there isn’t an easy 
way to write down what an arbitrary element of a(.A) even looks like). Now, we'll actually construct a measure that 


works: 


Proof of existence for Theorem 17. We begin with a definition: 


Definition 24 


An outer measure on Q is a map vy : 2% — [0, 00) which is monotone and countably subadditive. 


An outer measure is basically a “crude” measure which has certain desirable characteristics of a measure but not 


all of them, and the example we'll use is the following: 


Definition 25 
Let ~ be a measure on an algebra A. Then for all E C Q, define 


u*(E) = nt] Som) ‘A, € A, EC Uat 


f=1 i 
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In other words, w*(E) is the minimum measure we need to cover E using sets in A. We can verify that * is indeed 


an outer measure: 
« It is monotone, because any covering of B D A also covers A, meaning that u*(B) > u*(A). 
* It is also countably subadditive, because the union of coverings of A; also covers U; Ai. and we're still using a 


countable number of sets in A to do so. 


Lemma 26 


The outer measure p* extends u on A (the set of finite disjoint unions of half-open intervals) — in other words, 


u*(A) = (A) for all A € A. 


Proof. First of all, we know that u*(A) < u(A), since A covers itself. Suppose for the sake of contradiction that 
p*(A) < u(A). Then there is some countable union of half-open intervals that covers A with measure less than (A) 
(since the infimum over all covering is less than w(A)), and we can choose the covering so that each interval / in the 
covering is contained within one of the intervals of A (if / covers multiple intervals, then split it up, and then cut off 
the endpoints, which only decreases }* yz*A). Since A is made up of finitely many intervals (say N of them), by the 
pigeonhole principle, there is some interval / € A covered by intervals /, € Z such that }>, u*(In) < u(/). Pick the 


interval / with the largest deviation, and we know that 
Sou (Un) = wl) -€ 
n 


for some € > uA) u(A) > 0. However, u*(/,) = W(/n) (this can be checked very similarly to part (3) of Proposition 16, 
covering a compact interval with an open covering), and then again by part (3) of Proposition 16 we have subadditivity 


of yz over intervals, meaning 


uN) < So wn), 


a contradiction. Thus we cannot have u*(A) < w(A), and w*(A) = p(A) as desired. 


In other words, the outer measure assigns a value to every subset of our space (2, and it does so correctly on A. 
We've already seen that doing this assignment naively won't give us a valid measure, so to refine this argument, we 


need to find a suitable subcollection A* C 2® such that A* is a o-algebra and p* 


A» Sa measure. If we can find such 


a set, then A C A* implies that (A) C o(A*), which means we will have successfully defined our extension on o(.A). 


Definition 27 
A subset E C 2 is measurable with respect to p* if for all F C Q, we have 


(CEE) a Cee). 


Measurability will turn out to be the “niceness’ property that we want, and the set on which we will define our 
measure Is 


A* = {E CQ: E is measurable with respect to p*}. 


First of all, we need to make sure that we are working with a large enough subcollection: 


Lemma 28 


With the definition of A* above (and the algebra A in the theorem statement), A C A*. 
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Proof of lemma. We wish to show that any element of A is measureable with respect to *, or in other words that 


for any A in A, we have 
w(F)= Ww (FOA)+W(F OAS). 


One direction is clear: we have 
uF) <S w(FOA) +e (FAS) 


because the union of the coverings of F M1 A and FM A* always covers the left side. Now, suppose the left side is 
covered by some countable union of intervals /;, /2,---. We will show that we can cover the right side with at most 
4e more measure for any € > 0 (which will prove that the infimum on the right-hand side is at most 4€ the infimum 
on the left-hand side). 

Since A is a finite collection of intervals, each interval /, in the covering of F can only have finitely many transitions 
between the parts contained in A and A‘, so there are a countable number of changes between A and A‘ overall. 
Break up the intervals at those changes, which doesn't change the total uw of the covering. We can now cover FMA 
and FM A® as follows: FA is a subset of (U,3,/«) MA (since the /,s cover F), which is a countable union of 
intervals (each possibly closed or open). Similarly, FM A‘ is a subset of (UJ /,) MA‘, which is another countable union 
of intervals. Extend both sides of the /th interval in (U/«) MA (which has countably many intervals) by 5;, and similarly 
extend both sides of the jth interval in (U/«) M AS by 5. (Also, use this extra length to turn all of the intervals into 
half-open ones.) We've then used 4€ more w than in our covering of A, but we've covered both FM A and FN AS. 


Because this argument works for any covering of F, 


w(F)+4e> w(FOA)+ UW (FNAS), 


and taking € — O shows the other direction of the inequality, completing the proof. 


To finish our construction, we just need to show that A* is a o-algebra and that y* is countably additive on A*. 
First of all, A* is closed under complementation, because EF and E° are symmetric in Definition 27 (so if E satisfies 
the condition, so does E°). Towards showing closure under countable union, we will first prove that it’s closed under 
finite union. By induction, it suffices to show that if F,, Eo € A*, then FE, U E> € A*. By definition, this is equivalent 
to plugging F, U E> in Definition 27, and it suffices to just show the inequality 


uN(F) > w(F 0 (£1 U E2)) + w(F (EU E2)°), 


because the reverse inequality is true by subadditivity of u* (one of the properties of outer measure). To show this, 
note that 
(FO (E1U E2)) < w(F I Ei) + we ((F \ Ei) Ee) 


by subadditivity (the two sets on the right-hand side are disjoint and their union is F 1 (E; U E>)), and we also have 
w"((F \ Ey) 9 Eo) = w'(F 9 Ex) — w*(F \ (Ei U Ea) 
by the measurability of E> applied to the set FM Ef = F \ E;. Adding the two equations, the blue terms cancel, so 
w(F 0 (E1 U E2)) + w'(F \ (Er U Eo) < w(F 9 Ei) + w(F 9 Ey) = w"(F), 


where the last step comes from EF, being measurable. So A* is indeed closed under finite unions — to prove that A* 


is also closed under countable unions, we need to show that for any E; € A*, we also have Le E; € A*. First of all, 
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for any disjoint F;, E2 € A*, by the measurability of EF; applied to the set F 7 (E; U Ez), we have (for any set F) 


w(F 1 (E1 U E2)) = w(F 0 (EU Ee) 9 Ex) + w(F 9 (Ei U Ep) \ E1) = w'(F 9 Ex) + w(F 9 Ep), 


n n 
and thus by induction we see that | 1” (e< (U «)) = Seen E;)|. To now actually prove our claim, first 
i=1 i=1, 


assume without loss of generality that the Ejs are disjoint — because we already have closure under complementation, 
we can instead use the sets M, = Ey, Mz = Eo \ E, = Eo U Ef, and so on, without changing the overall union of the 


E;s. Now define 


n Co 
Ava Jes. ASE: 
i=1 i=1 


Since we have just proved closure for finite unions, we know that A, € A* for all n, and therefore 
u(F) = w(F OA) + Bw (F \ An) 


by the definition of measurability. So now applying the boxed equality to the first term and monotonicity to the second 
term, we have 
n 
u'(F) > So u(F NE) + u(F \ A), 
i=1 


and taking n — oo yields 


uX(F) > So ut(F NE) +u'(F\ A) > (FA) + u(F\ A) 
i=1 
by countable subadditivity of u*. The reverse inequality is true by subadditivity of outer measure, so we've verified the 
measurability of the countable union A, meaning that A* is indeed a o-algebra. 
Finally, to prove that u* is indeed a measure on A*, we need to show countable additivity, and we can do this by 


showing inequalities in both directions. For any measurable sets A;, we have 


1 (Da) Sow 


by countable subadditivity of the outer measure ~*, meaning that 


ua (U 4) Su (1 4’ = de (Ai). 


where in the last step we apply measurability of A, to the set Lies Aj, then apply measurability of Ap,_1 to the set 
ie. Aj, and so on. Taking n — oo, we obtain the other direction of inequality, proving countable additivity of p*. 
Since u* extends yw and is defined on the o-algebra A* containing A, we have successfully defined a measure on 


a(A). 


Putting the proofs of existence and uniqueness together, we have finally extended our measure to o(Z), as desired. 
And in particular, using the Stieltjes measure function F(x) = x, we have finally (after two lectures of work) successfully 


defined the uniform (Lebesgue) measure on [0, 1]. 
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4 September 16, 2019 


Our first homework assignment will be posted later today (after lecture); it will be due in class next Wednesday. There 
will be four or five problem sets in this class, and this first one will have us work a little more with properties of 
measures. 

We'll start with a restatement of the material covered in the last two lectures. We proved Theorem 14, which 
states that for any nondecreasing right-continuous function F : R — R, we have a unique measure wr on (R, Be) 
satisfying u((a, b]) = F(b) — F(a). The proof has two parts — first, we extend w from Z (the set of half-open intervals 
of the form (a, b]) to A (the set of finite disjoint union of such intervals). Then, we extend yu from A to its Borel 
o-algebra using the Carathéodory extension theorem; uniqueness follows from the 7-A theorem, and existence comes 
from the construction of the outer measure. 

This is a class about probability, so we'll start to introduce random variables and their properties today. We'll see 
soon that these variables are specific examples of measurable functions, and that their properties (like expected value 


and variance) are specific examples of Lebesgue integrals. 


Definition 29 


A probability space is a measure space (Q, F, w) such that w(Q) = 1. 


We will often denote uw as P for a probability space. A way to interpret this definition is that Q is a space of 
possible outcomes, F is the set of possible events that can be assigned measures (some sets of outcomes are not 


“well-behaved” enough to be assigned a probability measure), and P is the measure itself. 


Example 30 


If we have a sequence of n independent fair coin tosses, we can write down its probability space as 


A 
Q = {heads, tails}” = {0,1}", F=P(Q), P(A)= fe VAIS OS 


In this case, the set of possible events F can just be the full powerset of (2 because we have a finite set. And the 
way we've defined the space, our probability measure is uniform on the 2” possible sequences: for any individual event 
(that is, any set of size |A| = 1), we have P(A) = . 

In order to study properties of more general probability spaces, we'll need to be more abstract. From here, let 


(Q, F, w) be any measure space, in particular allowing for w(Q) = oo. 


Definition 31 
The indicator function for the set A € F, denoted 1,(w) or 1{w € A}, is defined to be 


1 wéA, 
0 wEeQ\A. 


la(w) = 


Definition 32 


A simple function f : Q— R is a function that can be written in the form f = a cila,, where c; € R and A; 


are disjoint sets in F. 
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Definition 33 


A measurable function f is one that can be pointwise approximated by simple functions, meaning that there 


exists a sequence of simple functions {f,} such that f(wW) = liMp soo fy(w) for all w € Q. 


(One way to phrase this is that the “nicely-behaved” functions are those which can be built up from indicator 


functions using pointwise limits of linear combinations.) 


Definition 34 
Let (Q, F) and (S,G) be two measurable spaces. A measurable mapping is a function f : 2 — S such that for 
every G €G, the preimage f—1(G) is an element of F. 


In other words, if we take any “well-behaved” set in S, its preimage is also well-behaved in Q. 


Proposition 35 
A measurable function f on (Q, F) (as defined in Definition 33) is equivalent to a measurable mapping (Q, 7) > 
(IR, Bg) as defined in Definition 34. 


Proof. This is an exercise left for our homework (we must show that being measurable in each definition also means 


being measurable in the other sense). 


The definition of a measurable function is more concrete, while the mapping Is a bit more abstract, but the latter 


is particularly useful because a measure 2 on the domain space {2 naturally induces a measure on the target space S: 


Proposition 36 
If u is a measure on (Q, F), and we have a mapping f : (Q,F) > (S,G), then we have a measure v on (S,G) 


called the pushforward measure, defined as 
W(G)=p(f-*(G)) VG eG. 


This is typically written as vy = fu or fy. 


(Soon in this lecture, we'll connect this abstract idea to the idea of a “distribution” from undergraduate probability. ) 


Proposition 37 


If f, g are measurable, then af + bg is measurable for all a, b € R, and so is f - g. 


Proof sketch. Write f and g as pointwise limits of simple functions and do some bookkeeping. Alternatively, we can 


check using the other definition of measurability, verifying that the preimage of sets of the form (a, b] (it suffices to 


check (a, oo]) are measurable. 


Definition 38 


A random variable is a (real-valued) measurable function defined on a probability space (Q, F, P). (We often use 


the variables X or Y rather than f.) More generally, an (S,G)-valued random variable is a measurable mapping 


(w, F) — (S,G). 
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Random variables will usually be real-valued in this class, but we may occasionally specify a different target space. 


Fact 39 (Helpful notation) 
Since f is a function from Q to R, we will often consider the pre-image X~1(B) = {w € Q: X(w) € B} of a set 


of real numbers B. We will denote this set {X € B}, so for example, {X > 5} = {w €Q: X(w) > FS}. 


To see why this is good notation, remember that X is a mapping from some space (Q, F, P) to the real line (R, B), 
so if we take the pushforward measure X,.IP, we get the distribution or law of the random variable X, often denoted 
Lx. We'll go through the symbol-pushing to make sure we understand how It behaves: £y is a measure on our target 


space (R, B), meaning that it takes in a measurable subset B € B and outputs 
Lx(B) = (%.(P))B = P(X~1(B)) = P({w €Q: X(w) € BS), 


the “probability” of X mapping to B. So measure theory gives a more rigorous meaning to P(X € B), which is notation 


that we like to use. 


Example 40 


Looking back at the coin-tossing example, where we have F = P(Q) and P a uniform measure on Q, consider 


the two random variables X(w) = wi = 1 {first toss is head} and Y(w) = >;_, w; = number of heads. 


We can visualize 2 as the vertices of a hypercube {0,1}”, where the map Y sends a vertex (x1,--- ,X,) to 
Xi +++++Xp. (So random variables often “condense information” and are not one-to-one.) Then the law Ly is a 
measure on (R, B) supported on {0,1,--- , n}, so we just need to find the weight assigned to each integer to describe 
the distribution. In formal notation, we have 

(i) 
Ly ({k}) = P(Y*({k})) =P ({w € {0, 1)": Y(w) = k}) = i 

and this is the familiar binomial distribution with parameters (n, $). 

Now that we have a more explicit description of our random variables, we'll move on to Lebesgue integration. 
For this part, assume that (Q, F, 4) is a o-finite measure space, meaning that there exists a sequence 2, ¢ Q such 
that w(Q,) < oo for all n. (For example, the real line with the (uniform) Lebesgue measure would work by taking 


Qn, = [—1n, n].) Suppose that we have a measurable function f : (Q,F) — (R, B); our goal is to define the quantity 


[ Fdu = [ F(w)du(w), 


such that in the case where u Is a probability measure P, we get the mean of f. Since we build up measurable functions 


from simple functions, it makes sense to do the same for integrals: 


1. For any simple function f = >>7_, cj1,,, define 
n 
[fae = Sawai. 
i=1 


It's bookkeeping to check that this integral doesn’t depend on the representation f as a simple function. Specifically, 


if we can write f as ee cla, = we dj1g,), then each A; must be contained within the region where f is nonzero, 


18 


which is contained within By U---UB,. Thus, 
n n m n m 
S> ci(Ai) =So cu U Ain Bj = yy oe (Ain Bj) F 
i=1 i=1 j=l i=1 j=1 
Similarly, 07", dju(Bj) = 0,30; Gu (Ai Bj). But because the Ajs are disjoint and so are the Bjs we must have 
cj = dj (equal to the value that f takes in the region) whenever A; 7 Bj. Thus the two expressions are indeed equal. 


2. Next, we define the integral for bounded measurable functions. For any measurable function f : (Q,F) > 
(R, B) with supyeg |F(w)| < oo and p({f ¥ 0}) < oo (the latter meaning that the function has bounded 
support), define 


[ta = sup { | edu :o@<f,¢ simple function} =S. 


It may seem arbitrary to approach f from below, so we might ask whether it also makes sense to approximate from 


above. It turns out both methods yield the same answer: 


Proposition 41 


Let / = inf {af ddu: o> f,dsimple function } and take the definition of S from above. Then / = S. 


Proof. We clearly have S < /, because any ¢@ < f in the definition of S is pointwise smaller than any @ > f in the 
definition of / (and thus the integral J dp is also smaller). To show that / < S, we will sandwich our function 
between simple functions. We have 


f fl -1 
ig <f< dae where puprer _ int al _ (2 | ) : uf Z 0}, 


where the idea is that FUPPe" and flow" sift the function f in increments of ;, and the 1{f # 0} term ensures that 
both of these functions are zero whenever f = 0. Additionally, if sup|f| = M (finite by assumption here), then 
fupper << M +1 and flowe’ > —M —1. So both FIOwe and FuPPer are bounded and only take on values that are integer 


multiples of 4 so they are simple. We can thus write 


1 
1-S< f epretdu— f gomrau =~ wlff #03), 


because those two integrals only differ by 2 and only on the region (of finite measure) where f 4 0. Taking m to 


infinity, we find that / — S < 0, establishing the other inequality and proving that / = S. 


3. Next, we define the integral on nonnegative (but not necessarily bounded) measurable functions f by approxi- 


mating with bounded functions from below: 


[tau =sur{ f naw :0 < h<f, h bounded with p({h 4 0}) <oo} 


Proposition 42 


The above definition is equivalent to setting f fdu = limmoo hee min(f, 7)du, where we integrate over sets 


Qm t Q of finite measure (meaning “(Qm) < co for each m). 


This gives us a more explicit formula for the integral, since we don't need to actually compute the supremum over 


all possible A and can just take the “truncated” functions min(f, ™). 
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Proof. Let Im = Jo, min{f, m}du and | = f fd; we wish to show that J, t /. The sequence {/} is increasing as a 
function of m, because we're integrating over a larger set and integrating a larger nonnegative function as m grows. 
And we know that / > liMm-300 /m, because each |, is an example of a function A in the definition above, so we just 
need to show the other inequality. 

By definition, for any € > 0, we can find a bounded function h such that to hdu > 1 —e. Because h is bounded, 


we can find some L such that |h| < L, and then for all m > L (and thus m> h everywhere), 


[rdw = [mint m}du 


-| minfh, mau | min{h, m}du 
OR Q\Qm 
S Im t+ b+ pw ((Q\Qm) 1 {h A OF), 


where we use that h < f in the first term and h < L in the second. Now L is a constant, and as m—> 00, Q\Qm1 2, 
so (Q\OQm)N{h 4 0} | BS. Also, the measure of any set (Q\ Qm)N{h F 0} is finite, because u({h # 0}) Is finite by 
definition. So by part (4) of Proposition 9, w((Q\ Qn) A {h ¥ O}) gets arbitrarily small, and in particular there will 


be some sufficiently large m such that 
| hdu < Im+e. 
Q 


Putting our inequalities together, we thus find that for any e > 0, 


j-e< [ hdp< Ine, 
Q 


so taking € > 0 yields / < /,,, giving us the other direction of the inequality and thus /p, t /. 


4. For the last step, we want to extend from nonnegative measurable functions to all measurable functions 
f :(Q,F) > (R, B). To do this, set 


f =f, — f_, where f, = max(f,0), f = max(-—f,0) 


(we can check that f, and f_ are both measurable if f is measurable) and define 


frau= ftdu- f td 


where we compute the two integrals on the right-hand side using the previous step. There are a few cases here: 
if both integrals are finite, then ri f du. is finite (in fact, i |f|du < co, and we say that f is integrable). Also, if 
only one of the integrals Is infinite, then we get co or —oo. But if we have co — oo on the right-hand side, we 
say that [ fd is undefined. 


Definition 43 


The expectation of a random variable X is 


O[X] = [ x(w)ae(2) = f xar. 


Defining this integral is useful because any function f : (R, B) > (R, B) that is Riemann-integrable is measurable, 
and the Riemann integral is the same as the Lebesgue integral with respect to the Lebesgue measure. On the 


other hand, the function 1g is not Riemann integrable (because it is very “spiky”), but the Lebesgue integral is perfectly 
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well-defined — the Lebesgue measure of a countable set is 0, so we have Te f = 0. Additionally, we can also interpret 


discrete sums as Lebesgue integrals by saying that 


Yr = fran 
i=1 R 


where n is the counting measure n(A) = |ANM N]. So Lebesgue integration can cover both very “smooth” and very 
discrete cases, and even some in between (which we'll see on the homework), which is useful because we have both 
discrete and continuous random variables in probability. And finally, we'll also soon see that Lebesgue integrals have 


nice convergence properties, which will be important for many of the results we show in this class! 


5 September 18, 2019 


Last time, we defined the Lebesgue integral. There are some problems about measurability on our homework (we 
should redownload the document because some typos have been fixed), so we'll start with a review of last lecture. 
(The homework is due next Wednesday during class, and Professor Sun says that it takes a while, so we should not 
leave it until the last minute.) 

Recall that a function f is simple if it can be written as a linear combination f = yy cjl,, of indicator functions 
(where the Ajs may have infinite measure). A measurable function is then the pointwise limit of simple functions 
(remember that on our homework, we show that this is equivalent to having f~1(B) € F for all B € Br). So to define 
the Lebesgue integral, we suppose that (Q, F, 4) is o-finite, meaning that there exist sets Qn, f Q with u(Qmn) < co 


for each m. Then we can define our integral first over simple functions as 
n n 
f= S- Gila, => U(A;) <0co => | fdu= y., cib(Ai) 
i=1 Q i=1 
and using this to define the integral of any nonnegative function fF: 


[few = sup { | :0<h<f,h simple function, w({h > 0}) < co} ; 


(We did this in two steps last time by defining the integral for bounded functions first, but this definition is equivalent. ) 
Then we define the integral of any real-valued f to be the integral of f, (the positive part) minus the integral of f_ (the 
negative part), and f, fdw = J, f(w)du(w) is then called the Lebesgue integral of f. Importantly, this construction 
works for any o-finite measure, not just the Lebesgue measure. 

Last lecture, we also started talking about random variables: given a measurable function X on a probability space 
(Q, F,P), we define its law to be the pushforward measure Ly = XP. (This is a useful in connecting the mean of a 
random variable E[X] to the Lebesgue integral [ XdP.) 


Remark 44. We can think of computing the Riemann integral as summing the area of a bunch of rectangles, obtained 
by partitioning the domain into small steps. In contrast, because the Lebesgue integral is built up by simple functions, 
we can think of the Lebesgue integral as summing rectangles coming from partitioning the target space into small 


steps. 


It can be proved that a function f : [a, b] + IR which is Riemann integrable is also Lebesgue integrable, and the 
two integrals both agree. (Basically, we can show this for simple functions and then approximate a Riemann integrable 
function from above and below with simple functions.) But as we mentioned last time, Lebesgue integration is more 


powerful because it can encode discrete sums (which Riemann integration cannot); additionally, we can also define 
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the Lebesgue integral in more abstract spaces (Q, F, 4), while the Riemann integral requires some sort of Euclidean 
structure. And today, we'll talk about a third advantage of the Lebesgue integral, which is that it is well-suited for 


convergence theorems. 


Remark 45. /n practice, the Lebesgue integral is rarely directly computed from the definition. Instead, we try to write 
down more explicit expressions — for example, if we can write f = h+1@ for a continuous function h, then we can 
Riemann integrate h and deal with 1g with a discrete summation. And even if the Lebesgue integral is defined on a 
more abstract space, we can often push forward onto the real line to get a more explicit expression, but we often need 


convergence properties for that. 


Convergence theorems answer questions like “if f, + f converges pointwise, does f f,du — { fdu?” The answer 


turns out to be no in general: 


Example 46 


If we define the functions f, = n- 1,1): the integral of each f, (over the real line) is 1, but the sequence f, 


converges pointwise to 0, so the limit has integral 0. 


Our goal is thus to show conditions under which the integrals do converge. For the rest of this lecture, we'll assume 
that (Q, F, 4) is a o-finite measure space, and that f,, f, Jn, g are all measurable functions on (Q, F). We'll also let 


OQ, FQ be a sequence of spaces with w(Qkx) < co for all k. 


Definition 47 

A sequence of functions f, converges to f with respect to almost everywhere if the numbers f,(w) > f(w) 
converge except for w in a set of u-measure zero. Similarly, f, —- f converges in u-measure if for all ¢ > 0, 
liMnr00 Me ({w : |fn(w) — F(w)| > ef) = 0. 


It will be a future homework problem to show that pointwise convergence implies -almost-everywhere convergence, 
which implies convergence in measure, but that the converses are not true. Our first convergence theorem will be the 


easiest to prove but unfortunately the least useful: 


Theorem 48 (Bounded convergence theorem) 


Suppose that a sequence of random variables f, are uniformly bounded, meaning that there is some M < oo such 


that |f,| < M for all n. Also suppose that p({w : f,(w) #0 for any n}) < oo. Then if f, > f in u-measure, then 
ftrdu— f fdu. 


In other words, we have convergence of integrals if the functions are uniformly bounded in value and also in support. 


Proof. Define E = {w : f,(w) # 0 for some n}. By assumption, notice that the limit f satisfies |f| < M everywhere 
except a set of measure zero, and the set {f 4 0} is contained in E (because outside of E all of the f,s are zero, 
so their limit is also zero). Thus, the only contributions to any of the integrals will come from E, and we can ignore 
Q\ E. Fix some € > 0, and define B, = {|f, — f| > e} (this is a subset of E). We can now bound the difference 
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between our integrals as 


i fdu— | fda = [[n- Fda 
Q Q E 


L. - Fa + [ ~ Fan 
< eu(E \ Bn) + 2Mu(Bn) 


< eu(E) + 2Mu(B,), 


where we use the fact that |f, — f| < 2M everywhere and that |f, — f| < ein E\ By. But u(B,) > 0 by definition of 


convergence in U-measure, SO We have 


lim sup [ts — Aan < ep(E). 
n->oo 


Because 1(E) is finite, sending ¢ — 0 shows that te f,d does indeed converge to te f du, as desired. 


Theorem 49 (Fatou’s lemma) 


Let f, be a sequence of nonnegative functions. Then 


J (imine fn) du. < liminf C fd). 
n-oo n->oo 


For this result to make sense, we're assuming that if f, are all measurable, then lim inf f, is also measurable — that’s 


an exercise we can check ourselves. And as a note, liminfp+oo f, is only the pointwise liminf of the f,s u-almost- 
everywhere, but that’s okay because we're integrating over it anyway and a measure-zero set does not change the 


integral. 


Proof. Let f = liminf f,. By definition, 
lim inf f, = sup (in fi) 
noo n>1 \@2n 
(recall that we can use sup instead of lim because the inner term (infgs, f) is nondecreasing with n). Let the inner 


term be gn = infg>n fe; notice that O < gp < f, for any n and g, t f. Because g, converges pointwise to f, it also 


converges in ~i-measure, so by the bounded convergence theorem (Theorem 48 above) we have 


| min{f, k}] = im, | min{ gn, k}du. 
on noo Jo, 


(Here we are importantly using that Q, has finite measure by definition.) However, we have 


n-> oo 


lim | min{gn, k}du < lim J sed < imine f de 
Qe n-oo n-oo 


where the first step comes from removing the upper cap on the function and also expanding the space Q2,, and the 


second step comes from gp < f,;. By Proposition 42, if we take k + co, to, min{f, k} converges to f f du (because 


all functions here are nonnegative), and thus the inequality between the two boxed terms yields the desired result. 


(If we look back at Example 46 and we plug in the functions f, = n- 1,1) into Fatou’s lemma, we see that 
0= f liminf p00 f,du is indeed at most 1 = liminfnsoo f frdw.) Fatou’s lemma may look unmotivated, but it is 


useful because of its applications to some powerful results: 
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Theorem 50 (Monotone convergence theorem) 


Let f, > 0 be a sequence of nonnegative functions, and suppose that f, + f. Then if frdu t ] fdu, where integrals 


are allowed to be infinite. 


Proof. Since the Lebesgue integral is monotone (if f < g, then f fdu < f gd), { f,du is a nondecreasing sequence 
of extended real numbers, and it converges to some limit / < t f du (here using monotonicity from f, < f). For the 


other inequality, Fatou’s lemma tells us that 


fru= | lim f,du < lim inf (/ du) = lim (/ du) =/. 
n->oo n->oo n->oo 


Thus the f f,dus indeed increase to { fd, as desired. 


Theorem 51 (Dominated convergence theorem) 


Let f, be a sequence of functions, and suppose there is some integrable function g (that is, | |g|dju < 00) such 


that |f,| < g for all n. If f, + f u-almost-everywhere, then f f,du > f fdu. 


Proof. By assumption, —g < f, < g for all n, so f, +g > 0 and g— f, > 0. Since f is the limit of the fs, it is also 


their liminf, so by Fatou’s lemma, 


iG + f)du < lim int [(g + fr)du. 
noo 
Cancelling out the gs and then rewriting this inequality with both +f and —f, we find that 
imsup f frdu< f fdy < mint f fd, 
noo noo 


but both the left and right terms are liMp-5o6 f f,d i, SO we have our desired equality. 


(With these last two theorems, it’s also useful to see how Example 46 fails each of the theorem assumptions. ) 


We'll now turn to an application of these results in probability: suppose we're working on a probability space (Q, F, P), 


and we have a random variable Y with mean E[Y] = J, YdP. Additionally, suppose that Y can be written as f(X), 


where X is another random variable. In other words, we can describe the random variables with the following diagram: 


(Q, F,P) > (S,G,n) > (R,B,v) 
Y 


Since p is the pushforward measure under X of P, meaning that wp = X¥P = Po X~, and v is the pushforward of 
uw under f and also of P under Y, meaning that v = fy = YzP, It’s natural to ask if we can compute the expected 


value of Y in both ways, giving us the change of variables formula 


[var [ ran. 
Q S 


Such a result would be useful if X is a real-valued random variable, because S would be R and the integral te fdu 


could be computed more explicitly. We'll start with a simple case to see that this does make sense: 
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Example 52 (Change of variables formula on a finite state space) 


Suppose 2 is finite, so we can describe P with a function p satisfying )>,,eq p(w) = 1. Any function Y is then a 


simple function, since Y = }o jeg Y(W)1 fu} = Doweg F(X (w))1 4a}, So the expectation of Y is (by definition) 


ay = | var = S> F(X(w)) p(w) | 


wEeQ 


But f is also a simple function (because it takes on only finitely many values, which are the values of Y(w)), so 


(here we only care about the values of f on the finitely many points X(Q)) 


F=S F(X) = [fau= do FOLD) 


xes xeS 


And now for any x € X(Q),u({x}) = (X#P)({x}) = P(X71({x})), which is the measure of the set of ws sent 
to x under X (and thus is the sum of the p(w)s). So the two sums both add up f(X(w))p(w) over all w € Q, 


proving that the two integrals are equal. 


However, it turns out that the more general formula does require more work and is not just symbol pushing — we'll 


see our convergence theorems in action: 


Theorem 53 (Change of variables formula) 
Suppose we have maps (Q, F, P) A (S,G, L) Ay (R, B,v), and let Y = f(X). If either f > 0 (Ff is nonnegative) 
or [ |f(X)|dP < oo (Y is integrable), then 


a= Fo<u)arw) = f rau= ff roar ox )(x). 


(Here, the first and last equalities are definitions of E[Y] and ie fdu, but including them explains the name “change 


of variables:” the first and last integral change the variable of integration from w to x.) 


Proof. This is a pretty standard proof method, so we should make a note of it: whenever we want to show a result 


about Lebesgue integrals, first prove it for indicators, and then built it up for more and more general functions. 
« First, assume f = 1g for some set B € G. Since f is simple on S, 
| fdu= ue) =P), 
But then Y = f(X) = 1g(X) is a simple function on Q, specifically 1x-1(gy, and thus J, YdP = P(X~1(B)) as 
well. Thus the formula holds when f is an indicator function. 


Since simple functions are finite linear combinations of indicator functions, it follows by linearity of the Lebesgue 
integral that f. fdu = J, f(X)dP for all simple functions. 


Next, we'll prove the formula for general nonnegative functions f > 0 (this is one of the conditions in the theorem 


statement). The useful trick here is to define the functions f, = min {ee a (This wouldn't work with n 
in place of 2” — we want to define these f,s to be nondecreasing.) Each f, is a simple function, and f, + f 
pointwise. From the previous step, we know that Jf f,du = f f,(X)dP for all n, so by the monotone convergence 
theorem (because f,(X) — f(X) pointwise) those integrals converge to [ f(X)dP = f YdP, as desired. 
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* Finally, for a general measurable function f, we prove the result by splitting f into its positive and negative parts, 


and this only works if Y is integrable. 


As we mentioned, this formula is most commonly used when we're trying to compute a Lebesgue integral. If we 


want to find EX for some random variable X on a complicated space (Q, F, P), we can set up the change of variables 


formula as 
(Q, F,P) 7 (R, B, Lx) “> (R, B, Lx), 


where this time we have Y = X. Theorem 53 then tells us that 


:[X] = i id(dLx(x), 


which is what we expect — the expected value of X only depends on its properties on the real line, and we compute it 


as an integral of x with respect to the law of X. 


6 September 23, 2019 


This is a final reminder that our first homework assignment (which should be printed and submitted in class) is due on 
Wednesday. Two TAs will be reading and grading, so if we are handwriting our solutions, we should make sure to be 
neat. (Writing more is not generally helpful, and solutions should be aim to be “correct in some minimal way.”) Also, 
because of a seminar this week, office hours have been changed to Monday 2-3 and Tuesday 11-12. 

So far in this class, we've been considering measures on the one-dimensional space R. Today, we'll be generalizing 
to higher dimensions — to do so, we'll actually need to study our construction on R in more detail. Recall that we 
started with the collection of half-open intervals equipped with a Lebesgue-Stieltjes function F, and we made the 
definitions 


LT={(a,b]}, u((a, b]) = F(b) — F(a). 


To get to a measure on Br, we first extended pz to A, which contained finite disjoint unions of elements of Z, and 
we had to check that & was indeed countably additive on A. Once we did this, we used the Caratheodory extension 
theorem to obtain a measure on o(A). This second step holds in an abstract setting, as long as A is an algebra 
(meaning it is closed under complementation and finite union) and ps is countably additive on A. However, the first 
step (where we extend from Z to A) used some topological properties of R. So if we want to do this step in general, 


we need to explain what properties of Z are actually required: 


Definition 54 


A semialgebra S is a set of subsets closed under finite intersection and semiclosed under complementation 


(meaning that if EF € S, then 2 \ E is a finite disjoint union of elements of S). 


For example, the set of half-open intervals Z is a semialgebra because 
R \ (a, b] = (—co, a] U (5, 00), 


and the two terms on the right-hand side are both in Z. (If we're concerned about the open upper bracket on (b, 00), 


recall that we actually defined Z to be elements of the form (a, b] MIR.) And once we have our semialgebra S, our goal 
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is to extend yu from S to the set A of finite disjoint unions of elements in S (in other words, the algebra generated by 


S). We can verify the following fact: 


Lemma 55 


Let S be a semialgebra over Q, and let  : S — [0, oo] be countably additive over S (so for any E;,E € S such 


that E = |, Ei, we have w(E) = S572, u(E;). Then there is a unique yu : A > [0,00] which extends yp on S 


and is countably additive over A. 


(This was essentially proved during lectures 2 and 3, and checking that the proof carries over is a bit of bookkeeping. 
The details will be on our homework.) We'll now apply this result to product measures. The idea is that if we have 
(for instance) a Lebesgue measure defined on two different axes, then the measure of a rectangle should just be the 
product of the Lebesgue measures of the two side lengths. To formalize that, for the rest of this lecture, consider two 
measurable spaces (S,G) and (T,H). The product space will just be Q = S x T, but it’s a bit harder to write down 


what the measurable subsets of (2 need to look like: 


Definition 56 
The product o-field F of (S,G) and (T,H) is the o-field a(S), where S is the set of “rectangles” 


S={AxB:AEG,BEH}. 


In particular, Lemma 55 will be useful now if we verify that S. We should draw some pictures to convince ourselves 
of this: the intersection of two rectangles is a rectangle, and we can decompose 2 \ (A x B) into rectangles in a 


picture too. But if we want to be more formal (for example, if we were writing this on a test), we would write that 
(Ax B)A(C x D) = (ANC) x (BND) 
(and indeed both sets on the right side are in S), and similarly 
Q\ (Ax B) =(Ax (T \ B))U((S\A) x T). 


So the (product) measurable space that we'll be trying to work with here is (Q,F) = (S x T,G@H) — all that’s left is 
for us to actually put a measure on it. Suppose that our original measurable spaces (S,G, A) and (T,H, p) are equipped 
with measures, and assume here that we have finite measures A(S), p(T) < oo (though o-finite is sufficient). Our 


goal is to define a product measure 4 = A ® p on (Q, F) using Lemma 55, which motivates the following verification: 


Lemma 57 
Given the measures » and p above, define uw : S — [0, 00) by setting u(A x B) = X(A) - e(B) for all AE G and 


B eH. Then wu is countably additive over S: that is, if we can write a rectangle A x B as a disjoint countable 
union of rectangles A x B =| ]*°,(Aj x Bj), then A(A)e(B) = S372, A(Ai) (Bi). 


Proof. Assume without loss of generality that all sets are nonempty. (Otherwise, discard them.) Consider the 
coordinate projection map 7 defined by (x,y) = x. If we apply m to both sides of Ax B = ||*, (Aj x Bi), 


we get 


i=1 


A=1(Ax B)=n (Ls x °)) = Ua 
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because all sets are nonempty (though we no longer need to have a disjoint union on the right). Thus, A is the union 
of the Ajs, and similarly B is the union of the B;s. Now by definition, (x, y) belongs to A x B if and only if there exists 
some unique index / such that (x, y) € A; x Bj. Thus, for any x € A, we can look at the cross-section x x B along B 


and break it up into disjoint pieces, meaning that 
B= | | 8 
i:x€A; 
Since we know that o is a measure on B, countable additivity then tells us that 
e(B) = S© p(Bi) 
i:x€Aj 


for any x € A. If we then define the function f(x) to be p(B) inside A and 0 otherwise (this is where we use our “area” 


intuition), we have 


co 


F(x) = 1{x € A}- p(B) = 1{x € A} D7 (Bi) =D) 1{x € Ai}o(B)). 


ExE€Aj i=1 


Since f is a number p(B) times an indicator function 1,, f is a simplee function on S. Similarly, the right-hand side is 


a pointwise limit of simple functions (taking the partial sums). Thus we can integrate both sides over S (with respect 


to A) to get 
= [ [pate areca] roo 


Since all functions are nonnegative, the partial sums are monotone, and we can apply the monotone convergence 
theorem, which tells us that the right-hand side is f [S772, 1{x € Ai}p(B;)] A(dx) = OF, f [L{x € Ai}o(Bi)] A(dx) = 


S> (Ai) 0(Bi) , as desired. (So in other words, “we can interchange the sum and the integral” in this case.) 
i=1 


With this, we can apply Lemma 55, we've found a unique % = A ® ¢ that extends our definition on S and is 
countably additive over A, so then applying the Caratheodory extension theorem gives us a measure pz on the product 
space (Q, F). It is also natural to generalize to a product measure of the form [1 ® [2 @- ++ @ fy — we can check that 
the order of operations doesn't matter, so such a product measure can also be defined. (And beyond that, it’s also 
very common to work with infinite product spaces and define @72,(Qi, Fi, wi), but these are more tricky and we'll 
talk about them next lecture.) 

For now, we'll turn to discussing the generalization of the Lebesgue integral to product spaces. Recall that we're 
not just working with Riemann double integrals — because Lebesgue integral can also deal with summations, we're also 
in the setting of double sums of the form }7; ; aij. There is some trickiness here, because while we can always swap 


the order of summation for finite sums, like 


we can run into issues with infinite sums: 
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Example 58 


Define a doubly-indexed sequence 
MS =) 
aj= y—-1l :=s4+1, 
0 otherwise. 


Then the order of summation changes the final answer }7; ; aij: we have 


In cases like in this example, we'll say that the double sum i aj; is undefined (and we can't freely change the 
order). But there are conditions where swapping the order of summation (or more generally, integration) is allowed, 


and those are also the situations where we can define the Lebesgue double integral: 


Theorem 59 (Fubini- Tonelli) 
Let (S,G,2) and (T,H, p) be finite measure spaces, and define (Q,F7,uw) =(S x 7T,G@H,A@p). Let f bea 
measurable function defined on (Q, F). If f > 0 or f |f|djs < oo, then 


at Fl v)do(y) a(x) = f rau = [ if FxyArbo| do(y). 


(More specifically, Tonelli’s theorem covers the f > 0 case, and Fubini’s theorem covers the f[ |f|du < oo case.). 


In particular, the middle integral is defined by the construction of the Lebesgue integral, but it’s not clear a priori that 


the left and right double integrals actually exist. 


Proof. We'll prove the first equality — the other one follows by symmetry — and we'll do so with the “start with 
indicators” method. First, suppose that f = le for a measurable E € F = G@H. Fix some x € S, and define 
f(y) = f(x, y) to emphasize that f is just a function of y for now. Then 


(le)x(y) = W(x y) € E} = ly € Ex}, 
where E, is the cross-section of x across E in the space T. 


Lemma 60 


The cross-section Ey, as defined above, is a measurable set (in 7) for any x € S. 


Proof of lemma. A slick trick here is to define 
Fy ={E © F : Ex measurable} C F. 


Our goal is to show that F, = F. We can check that F, is a o-field: 


- If E € Fy, then E, is measurable in T, and thus (because H is a o-field) so is its complement T \ Ey = (Q\ E)x 
(the complement of the cross-section of E is the cross-section of the complement of EF). Thus Q\ E € Fy, as 


well. 
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- If Ej © Fy for all i, then each (Ej) is measurable in T, so the countable union U72,(Ei)x = (es Ei). by the 
same logic (here we use that the countable union of the cross-section of E is the cross-section of the countable 


union of E), so UP, E; is also in Fy. 


Additionally, the cross-section of any rectangle A x B is always B or the empty set (depending on whether x is in 
A or not), which are both measurable, so F, contains S. But this means that o(F,) = Fy contains a(S) (the whole 
o-field), so Fy = Ff, as desired. 


Thus f(y) = 1{y € Ex} is a measurable function of y on (T,H), meaning that the inner integral f- f(x, y)do(y) 


is Indeed well-defined and satisfies 
i F(x, y)dply) = le(x, y)do(y) =) l{y € E,}dpo(y) = o(E,x). 
T T T 


We now need to evaluate the outer integral, and to do so, we first need to show that the map x — p(Ex) is a 
measurable function on (S,G). (The picture we should have in mind is that we slide x along the S-axis, outputting 
the measure of its cross-section in E, and we want to verify that this is a measurable function.) To do this, we use a 


similar trick as before, defining 
Fe= {E € F : x — p(Ex) is measurable on (S,G) and i p(Ex)dA(x) = u(e)} CLR, 
Ss 


(This final condition is included because it is the eventual result we want to show, which is that this double integral is 
indeed equal to Jo ledu= Te f du.) We want to show that Fp) = F, and we'll do so with a pi-lambda argument. Like 
before, F, contains all rectangles in S, because for any E = A x B, p(E,) is equal to p(B) if x € A and 0 otherwise, 
so p(Ex) = p(B)-1{x € A}, which is indeed a measurable function integrating to [ p(E,)dA(x) = p(B)A(A) = u(E). 


Also, S is a m-system (semialgebras are closed under intersection by definition), and F, is a A-system: 


+ The whole space 92 = S x T is an element of F,, because it is a rectangle. 


« For any two sets £1, Eo € Fp with E; > Eo, notice that 


p((E1 \ E2)x) = p((E1)x \ (E2)x) = e((E1)x) — e((E2)x), 


because the cross-section of the difference of E,; and E> Is the difference of the cross-sections and because 
(E1)x D (E2)x. Since both terms are measurable functions of x and the difference of measurable functions 
is measurable, o((E1 \ E2)x) is indeed a measurable function. Additionally, the integral of e((E1 \ E2)x) is 
p(E1) — p(E2) = p(E1 \ E2) by linearity of the integral. Thus Ey \ Eo € Fp as well. 


Finally, we want to show that if E; € Fp for all i and E; ¢ E, then the limit E is also in Fp. Since (E;)x t Ex, 
P((E;)x) t p(Ex) for all x by continuity from below. So Ey, is measurable (as the countable union of the Ejs), 
and f p(Ex)dX(x) is the limit of the f{ ((E;)x)dA(x)s by monotone convergence theorem, which is indeed 
liMj+00 P(E7) = p(E). Thus E € Fy. 


So now by the a-A theorem, because F, is a A-system containing S, it also contains o(S). Thus Fy = F, 
meaning that for any E € F=GOH, the indicator function le satisfies Fubini- Tonelli. From here, we perform the 
usual machinery: the result is also true for simple functions by linearity, so it’s also true for nonnegative functions by 


approximating from below with the monotone convergence theorem, and thus it’s also true for all integrable functions 


by splitting into the positive and negative parts. 
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7 September 25, 2019 


Last time, we constructed the product measure on the product of two o-finite measure spaces. Specifically, given 
(S,G,) and T,H, pe), we explained how to define Q=Sx T, F=G@OH, and w= ®p. Note that we can also 
define non-product measures, such as the uniform measure on a triangle in our space: some examples of these will 
be on our homework for next time. (And as mentioned, we can always extend this to arbitrary finite product spaces 
(TTi1 23, Qa Fi, @/-1 Hi) by induction.) 

We also mentioned during our discussion that infinite products are more subtle, and that’s what we'll be going 
into today. For example, consider the percolation model on the infinite integer lattice grid, where at every point 
(x,y) € Z?, we flip a coin to decide whether we have a 0 or a 1. The percolation problem then asks us about 
the large-scale structure of this model. And we can also build more complicated models off of this (for example in 
statistical physics), where perhaps we have some nearest-neighbor interaction between the sites. Our first instinct 
might be to avoid infinite products completely and only consider a finite large subset of the lattice, but often the 
boundary conditions of such a system are annoying to deal with. So here are the two main results we're going to prove 


today: 


* a special case of the lonescu-Tulcea theorem, which says that we can define an infinite product of probability 


spaces with no further conditions, and 


* the Kolmogorov extension theorem, which says that we can define a non-product measure on an infinite product 


space, with a mild regularity condition. 


We're using a product sigma-algebra in both cases, so we'll start by defining what that means for an infinite product: 


Definition 61 


Let (Qa, Fa) be measurable spaces, indexed by a € / (so there can be an uncountable set of as). The product 


o-field F = ® 


measurable. In other words, we require that (11)7 (Eq) € F for all Eg € Fa. 


ae) Fa IS the minimal o-field over Q = [[y¢, Qa such that all single-coordinate projections mq are 


(This definition may look familiar if we’ve heard of a product topology.) The preimage of Eg can also be written 
(ta) (Ee) = Eq Xx II Qy, 
yel\{a} 


and because / can be uncountable, this is not the same as just taking arbitrary Cartesian products of measurable sets 
E, in each Fy. Specifically, we are only allowed to take countable unions of these sets, so at most countably many of 


the sets in our product are allowed to be not the whole set Qy. 


Definition 62 (Notation) 


For any subset J C / of the index set, we will define the partial products 


Ope [Ga Sos 


acl acl 


To understand why this is a useful definition, we'll bring in the probability intuition: we often care about the 
finite-dimensional distributions of a probability measure, also called its marginals. Specifically, suppose we have a 


product measure on an infinite product probability space (Q,F,P), where the index set / is countable. Then any set 
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E = |] ye; Fa is measurable, so we would guess that 


2 
P(E) = II Pa (Eq). 
ael 
(This turns out to not quite be true, but we're just talking about intuition for now.) In order for this probability to be 
positive (nonzero), most of the probabilities Pa (E,) need to be pretty close to 1, meaning that most of the dimensions 


basically need to include the whole space. That motivates only looking at distributions along finitely many dimensions: 


Definition 63 
Let J be a finite subset of our index set /, and let P be a measure on the infinite product space Q. The finite- 
dimensional marginal (or finite-dimensional distribution) of P on J is the pushforward measure on (Qy, Fy), 


where for any E, € Fy, 


(ry) ¢(P)] (Es) = P(mz*(Ey)) = P(Es x Quy). 


If we want these marginals to make sense, then we should get consistent distributions no matter how we “project 
down.” In other words, if we have two finite subsets J’ C J C /, then the marginal Py of P on J’ should be the 
same whether we're projecting directly from / to J’, or from / to J and then J’. (And the point of the measurability 
condition in the definition of the product o-field is necessary to make sure everything in this consistency statement Is 
well-defined. ) 

So when we construct a measure P on an infinite product space, there are already some consistency conditions 
that P must satisfy — a natural question is to ask whether we can just use these conditions to construct P. And in the 
results we'll show today, we'll start with the family of Pys (also called a family of finite-dimensional distributions), 
and we'll get a measure P consistent with those marginals. 

Before we get to the main results, we'll go over a useful alternate version of the Carathéodory extension theorem 


to apply in the infinite-dimensional case (which doesn't require us to check countable additivity directly): 


Lemma 64 (Variant of Carathéodory) 
Let A be an algebra over 2, and let uw : A — [0, 00) be a finite measure which is finitely additive. Suppose that 


for any sequence B, € A with B, | 2, we have u(B,) J 0. Then wu is also countably additive over A (so the 


Carathéodory extension theorem applies). 


Proof. We need to prove countable additivity — let A, € A be disjoint sets such that A = |”, A, is also in A. For 
each positive integer n, the set B, = A\ |]i_, Aj is also in A (because it can be constructed with just finite unions 


and complementations of Ajs and A). By finite additivity, we thus have 


u(A) = S7u(Ai) + W(Bn). 


i=1 


But B, decreases to the empty set, meaning that u(B,) > 0 by assumption. Thus taking n — co shows countable 


additivity, as desired. 


We now want to specialize this lemma to our infinite product space (Q, F) = ([[ Qa, ®Fa), and we'll set up the 


notation now. As an exercise, we can check that 


A= {(my) (Ey) = Ey x Qny: J finite set, Ey € Fy}. 
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is an algebra (we just need to check complementation and finite union, both of which are pretty easy). If we now 
consider an arbitrary sequence B, € A with B, | @, each B; only involves finitely many coordinates, so the total 
number of coordinates that can be involved in any of the Bjs is countable. So we'll renumber the relevant coordinates 
to Q = {1,2,3,---} for convenience, and we may assume without loss of generality that By, depends on the first 
coordinate, Bz depends on the first two, and so on (we can check that both of the following proofs work whether each 
coordinate in Q represents a single Qa or a product of them). This means that our sequence of B,s can be written 


more explicitly as 


Bn = By x Qe\ [nr] xX OQrest | 


where [n] = {1,2,---,n}, Bn € Fin}, and Qrest is the full probability space Q on the index set /\Q. (Here, the subset 
Q can depend on the B, but is always countable.) Then to apply Lemma 64 in our theorems, we just need to show 
that u(Bn) J 0. 


Theorem 65 (lonescu-Tulcea theorem, special case) 


Let (Qa, Fa, Pa) be probability spaces indexed by a € /. There is a unique probability measure P on the product 


space ([] yc) 2a, Wee) Fa), where the family of finite-dimensional distributions is given by Py = @ ec) Pa for any 
finite J. 


This is essentially us creating a “product measure” on our infinite-dimensional space, since we are requiring product 


measures on all finite-dimensional QQys. 


Proof. As discussed above, we will be applying Lemma 64 using the algebra A = {Ey * On y J finite, By Fi To 


have the correct finite-dimensional marginals, we should define 
P(E, x Qs) = (@ P| (Ey) 
acd 


for each Ey € Fy (note that EF, may not be the product of Eas, so we need to write the right-hand side in terms 
of the product measure Py = (@wey Ps). It's a bookkeeping exercise to show that P is well-defined and finitely 
additive over A, so we'll be done by Lemma 64 if we can show that P(B,,) | 0 for any B, of the boxed form above. 
By definition, because Bp, is of the form Ey x Qy ) for J = [n] and Ej = By, 


P(Bn) = Pray (Bn) = Pinta (Br X Qn41) > Pinta (Brta) = P(Bn41). 


(where the second equality comes from consistency of the marginals, and the inequality comes from the B,s being 
decreasing and thus decreasing within Ej,41)). Thus, the P(B,) form a nonincreasing sequence of numbers, and we 


just need to show that its limit is zero. By Fubini’s theorem, we can write out our integral as 


P(Bn) =Pin(Be) = f | f 1{B:}4P] oP. 


To visualize the next step, consider the following doubly infinite array: 


P(Bi) 1{By} 1{By} x Q2 1{By} x Qe x Q3 
P( Bz) J 1{B2}dP. 1{B5} 1{Bo} x Q3 


P(B3) Jf1{Bs}dPidP. f 1{B3}dP2 1{B3} 
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In other words, if ajj is the entry in the /th row and jth column, (and we start the array from the first row and zeroth 
column), then aj; = 1{B;} for any i. Additionally, we multiply by Q,,1 when we move from column nto column n+1, 
and we integrate with respect to P,+1 when we move from column n+ 1 to column n. 

We then notice that the kth column consists of functions on the first k coordinates (the leftmost column is a 
sequence of numbers, the next column is a sequence of measurable functions on ((Q21,F,), the following column is 
a sequence of measurable functions on (Q, ® Q2, 7; ® Fo), and so on). Additionally, each column is nonincreasing, 
so they all have (pointwise) limits. Let g, be the pointwise limit of the nth column, and suppose for the sake of 
contradiction that gg > 0. By the bounded convergence theorem (which we can use because probability spaces are 
finite and our functions are bounded from above by 1), we can interchange the limit and the integration with respect 


to Pp+1, meaning that 
In = J dessa for all n. 


Now because gp is positive, there is some w; € 92; such that gi(w1) is positive. But then gi(w1) = f go(w1, w2)dPo(we), 
meaning there is some wo € Qe such that go(W1,W2) is positive. Repeating this, we have an infinite sequence 
(Wy, W2,---) such that gn(wi1,--- ,W,) > O for all n. Remembering that the columns are nonincreasing, we can only 
have 9n(Wi,-++ ,Wn) > 0 if 1{Bn}(wi,--+ , Wn) > 0, meaning that (w1,--- , Wn) € Bn, SO (W1,W2,--+) € Bn X Qaytn- 


Putting this together across all n, we find that 
(Wy, Wo, 9 ) = ()E, x Qa\ In): 
n 


But this is impossible: if the sets B, = Br Xx QQ\[n] xX Orest decrease to zero, then the intersection on the right-hand 


side must be empty. Thus we have a contradiction, meaning that we really have gg = 0, so we can apply Lemma 64 


to finish the construction of the measure. 


Theorem 66 (Kolmogorov extension theorem) 
Let Q, be metric spaces with Borel o-fields F,. Suppose we have a (not necessarily product) consistent family 


of finite-dimensional distributions {Py}, such that Py is inner-regular for all J (meaning that the measure of any 


set can be approximated from within by compact subsets). Then there is a probability measure P on the whole 
space (Q, F) with these finite-dimensional distributions {Py}. 


Proof. Use the same A as in the previous proof, and again define P : A — [0, 1] by setting P(Ey x Qy,,y) = Py( Ey). 
(Again, it is a bookkeeping exercise for us to show that P is well-defined and finitely additive over A.) Then we again 
need to check the condition of Lemma 64: given sets of the form B, = B, x QeQ\ [ny X Qrest decreasing to the emptyset, 
we wish to prove that P(B,) | 0. 

Suppose for the sake of contradiction that P(B,) | € > 0. The key step is to replace each B, with a compact set 
K,, sitting inside it such that the same hypotheses still hold. By assumption, Pin] 1S in our family of finite dimensional 
distributions, so it is inner-regular, meaning we can find a compact Cn C By such that 


Pra(Ca) =Pig(ea— nei 


The sets C, = Cy, X Q)\[n) are now no longer necessarily decreasing, but we can define 


n 
Kn = (| Ce x Qniveg 
é=1 
for each n — these are closed subsets of compact sets (because the nth term of the intersection is the compact set 
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CG) and are thus compact, and K,41 is contained within K, x Qn41. So the sets 
Kn = Kn X Quy 


are decreasing and contained in the B,s, so they decrease to the empty set. (At this point, it’s important to note that 
the K,s are not necessarily compact — if they were, we'd already have a contradiction, because a nested sequence of 
compact sets has a nonempty intersection.) But now we can check (left as an exercise to us) that using the compact 


subsets does not lose us much measure, and we have 
E 
P(Kn) > P(B,) ~~ om 


Thus, P(K,) | 6 > a decreases to some positive value. But K, is contained in By, and the latter decrease to the 
empty set, so K, | @ as well. Now K,, is nonempty for all n (because it has a positive measure), so we can find some 


w' € Ky. Let w?” = w""), and for every £ > 1, consider the sequence 


(mq (Ww) 4» 


meaning that we only consider the first 2 components of each w in the sequence. Since this is a sequence contained 


BA yl on. 


in the compact set Ky, it has a convergent subsequence — let those elements form the next sequence w 
(In other words, the {w®"},>1 sequence converges in the first 2 coordinates.) 

But we can now take the diagonal entries w™” and form a new sequence out of them — because we defined our 
sequence to converge in the first 2 coordinates for any 2, we know that {7(w”™”)}n»>1 converges to some xg € Qe for 


all 2. But this means that there is some point (x1, X2,---) which is in all of the K,3s: 
(x1, X2, a) E () Kn x Qe\[n]- 
n=1 


And much like in our previous proof, this is a contradiction: the sets K, = Ky, x Qe\ [nr] X Orest decrease to zero, so the 


intersection on the right-hand side must be empty. Thus P(8,) must actually decrease to 0, so we can again apply 


Lemma 64 and get a valid measure P, as desired. 


8 September 30, 2019 


Last week, we did a lot of work to construct product measure spaces, especially in the case where we have an uncount- 
able product of the form (Q, F,P) = (Tac; Qe: Wee) Fa: Vac; Pa). Today, we'll mostly talk about “easier things” 
motivated by undergraduated probability — we have likely seen concepts like variance, correlation, and independence 
before, and we'll basically review them now in a more formalized setting. Note that we're now considering arbitrary 


probability spaces (and are no longer assuming any kind of product structure). 


Definition 67 
Let X be a random variable on (2, F, P) (in other words, let X : Q — R be a measurable function). The o-field 


generated by X, denoted o(X), is the minimal o-field over Q such that X is measurable. In other words, 


o(X) = {X*(B): B € Br}. 


We can check that this is a o-field (for example, the preimage of the complement is the complement of the 


preimage). Intuitively, o(X) tells us about the events in Q that we can describe just based on how X maps the space 
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into R. For example, if X just sends everything to 0, we can’t distinguish very much — the preimage of any Borel set 
B is Q if it contains 0 and @ otherwise, so o(X) = {@, Q} and our random variable can’t tell us very much about Q. 


In other words, “the larger o(X) is, the more helpful X is for giving us information.” 


Definition 68 


Two events A, B € F are independent (denoted A IL B) if and only if P(AN B) = P(A)P(B). 


In particular, we also have that 
P(An B‘) = P(A) — P(An B) = P(A) — P(A)P(B) = P(A)P(BS), 
soAILB = > ALB‘, and similar with BS instead of B. We can also generalize independence to more events: 


Definition 69 
The events {Aq : a € /} are mutually independent if for all finite subsets J C /, 


P (n A.) = | 2@o) 


acl acl 


Extending this definition, if {Cg : a € /} are each a collection of events, then the collections are independent if 


the above equality holds for all finite J C / and for each Ag € Cy (for all corresponding a). 


Definition 70 
A set of random variables {Xq : a € /} defined on the same probability space (Q, F, P) are mutually independent 


random variables if and only if their o-algebras {o(Xq) : a € /} are independent. 


In particular, for any finite set J C / and any measurable subsets By € Br, we have 


P(Xq € Bg for alla € J)|=P (n x48) = [[ P(x2*(82)) =| [[ P(X € Ba). 


acJ acl aed 


by mutual independence (because the sets X;!(Bq)s are each in the respective o-algebras o(X.)). And this proba- 


bilistic statement is what we may have seen in an undergraduate probability class as the definition of independence! 


Example 71 
By definition, a collection of events {E,} are independent if and only if the corresponding indicator variables 


le, :a@ €/ are independent random variables. 


a 


Example 72 
Consider the special case where we have a product probability space (Q, F, P) = es ya ae Coreen) 


(as we've previously constructed). Then we can define random variables Xq(w) = fa(Wa) for each a, which are 


measurable functions that only depend on the a-coordinate. 


The preimage of any B € Bp is then 


x B= (28) I 
yel\ {a} 
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(because only the a-coordinate matters for Xq), so the o-field for the random variable Xq is 


a(Xa) =O(f)® & Ty 
yel\{o} 

where 7, is the trivial o-field (@, 21) (since we don’t have any “information” about the other coordinates from Xq, 
and it’s okay to include @ in any coordinate because that just gives us the emptyset overall, which is the preimage 
of the emptyset). We can then easily see (by applying the definition and writing out the preimages) that these o-fields 
o(Xq)s are independent, meaning that all of the (Xq : a € /) will be independent random variables on (Q, F, P). 

The example above may seem like a special case for independence, but it turns out to be basically the only one! 
In general, suppose we have some probability space, and suppose that {Xq, : a € /} are independent random variables 


on that space. Then define a new random variable X which is basically a tuple of the Xqs, setting 


X(w) = (Xe(w) oer € [[R. 
ac! 
Then X is a map (Q,F,P) > ([Ixe;R: ae; Be, Lx), where the law of X is Lx = X~¥P = Po X71. But then 
we can check that the Xgs are independent random variables if and only if the law Lx is a product measure (by our 
discussion of product measure constructions from last lecture) — in other words, we should treat “independence of 
random variables” and “law is a product measure’ as one and the same. 
We've now reached (basically) the end of measure theory in this class — we now have enough theory developed 


to talk about more interesting aspects of probability. 


Definition 73 
Let X be a random variable in (Q, 7, P). The pth moment of X is the value of E(|X|°) (which may be infinite), 


and the L? norm of X is ||X||p = ||X||o(a.7,p) = E(|X|?)1/?. We say that X belongs to L? if ||X||p is finite. 


The following is a useful fact about integrals that we won't prove here: 


Theorem 74 (Jensen's inequality) 

Let G C R be an open interval, and let g : G > R be a convex function, meaning that g(Ax + (1 — A)y) < 
Ag(x) + (1 — A)g(y) for all A € [0,1] and x,y € G. Then if X is a random variable with P(X € G) = 1 and E|X| 
and E|g(X)| both finite, then E(g(X)) > g(E(X)). 


In particular, this tells us a useful inequality for the L? norms of a random variable: 


Corollary 75 (L? monotonicity) 
Let 0 <r < p<, and let X bea random variable such that ||X||,, ||X||p are both finite. Then Jensen's inequality 


applied to the convex function g(y) = |y|P/" yields E(|X|?) = E((|X|")?/") > (E|X|")?/", so ||X] |, < |[X]|, when 


PS p: 


Since an infinite L? norm is “larger” than any finite L? norm, we should expect that it is not possible for ||X||, to 
be infinite but ||X||p to be finite. This is indeed the case, but we have to deal with infinite expectations with a bit 


more Care: 
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Proposition 76 
Let 0<r<p<oo. If X is a random variable with ||X||, < 00, then ||X||- < oo; in other words, L'(Q, F, P) D 


L?(Q, F,P) when r < p. 


Proof. We split up the expectation into two terms 


(|X|) = E(|X|"1{|X] < 1}) + E((X|"1{|X] > 1}), 


which we'll rewrite (from here on, this will be standard notation in this class) as 


= (|X|; |X| <1) + 


(|X |": |X| > 1). 


The first term is bounded by 


on the second term. But E( 


£(1) = 1 (because |X|" < 1 whenever |X| < 1), so finiteness of the L” norm only depends 


XI": |X| > 1) < E(|X|P: |X| > 1) (because |X|? > |X|" whenever p > r and |X| > 1). 


Thus, if ||X||p is finite, then 


is finite. 


U(|X|?; |X| > 1) is finite, so E(|X|"; |X| > 1) ts finite, so 


4(|X |") is finite and thus ||X 


|; 


However, it’s important to note that this L? monotonicity only holds for probability spaces — we'll verify the 


following claims on our homework: 


« Consider the sequence spaces @°, which contain elements of the form x = (x1, Xo,---,). The 2? norm of such 


a sequence is defined as 


00 1/p 
IIx||> = (>: ni") 
j=1 


In these cases, if r < p, then ||x||, > ||x|p (this inequality goes in the opposite direction as Corollary 75, and its 


proof does not come from Jensen's inequality), meaning that we actually have 2” C 2? when r < p. One way to 


remember this is that the unit ball in 2” is nested in the unit ball for 2° — for example, the unit ball in 2! is the 


set of points such that |x| + |xo| +--- < 1 (which is a “diamond”), while the unit ball in £? is an actual “ball.” 


And as p increases, this pattern continues — we can check that the 2° unit ball is a cube of side length 2. 


norm of a function is 


fllp = (fircover) 


In this case, It turns out that the L? and L" function spaces are not nested in either direction. 


If we now turn back to our random variables X on (Q, F,P), Proposition 76 tells us that 


and when these moments are finite, we have 


(|X|?) <0 => 


nonnegative quantity (which we should be familiar with): 


Definition 77 


# |X| < 00, 


Let X be a random variable with finite second moment. The variance of X is 


Var(X) = E(X — E[x])? 
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= E(X*) — E(x)? 


On the other hand, consider the classical function spaces /°(IR, B, 4) under the Lebesgue measure, where the 


1|X| < E(X?)1/? by Jensen's inequality. Thus, we can define the following 


Theorem 78 (Cauchy-Schwarz) 


For any two random variables X,Y € L?, we have E(|XY|) < ||X||2/|Y||2. 


Proof. If ||X||z2 = 0 or ||Y||2 = 0, then either X = 0 or Y = 0 (almost everywhere), so E(|XY|) = 0 and the equality 
holds. Otherwise, both ||X||2 and ||Y||2 are positive. First, consider the special case where ||X|2 = ||Y|2 = 1 and we 
assume that E(|XY|) < oo. We know that 


(|X|?) + E(IY|?) — 2E(|XY|) = E(|X| — ||)? 2 0, 


so E(|XY]|) < 1 and the equality holds. In the more general case whenever E(|XY|) < 00, we can scale X and Y to 


have norm 1 (by dividing by ||X||2 and 
IRAP 


respectively), so that the special case implies 


1 = E(XY]) < |IX]l2ll¥ le. 


| x ¥ 
“TXTl2 IV lle 


Finally, if we do not know that E(|XY]) is finite to start with, we can use the previous case with capped random 


variables (so that we have finite expectation): 


b(|min{|X], n}, min{]¥ |, n}]) |< |] ming |X], mf [tall min{]Y |, oF lle < |X TleIlY lle |. 


We can now send n — oo and use the monotone convergence theorem on the left-hand side to get the desired 


result. 


This allows us to also define another familiar quantity that we use to study random variables: 


Definition 79 


Let X,Y € L? be two random variables. The covariance of X and Y is given by 


Cov(X, Y) = E((X — E[X])(Y —E (XY) — E[X]E 


and X,Y are uncorrelated if Cov(X, Y) = 0. 


In particular, any two random variables X,Y € L? that are independent (X ILY) are also uncorrelated, but the 
converse is false (random variables can be uncorrelated but not independent). 

Now that we have all of our definitions set up, we're going to start to work towards proving the laws of large 
numbers. (As a joke, a “standard” example of such a law is that if someone goes to prison and they flip lots of coins 
in their free time, they should expect close to half of them to be heads.) We'll see that many of these laws will turn 
out to be applications of the Pythagorean theorem (like the proof of Cauchy-Schwarz secretly was). 

Today, we'll start with the simplest case. Suppose X1,--- ,X, are pairwise uncorrelated random variables, and 
consider their sample mean or empirical mean X, = ye X;. Because covariance is linear in each argument 


(because the Lebesgue integral is), we have 


n n 
Var X, = < Cov (>: x, 32%) 


i=1 i=1 
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Because the Xjs are uncorrelated, all cross-terms disappear and this simplifies to 


ee ee 
Var Xp = 72 s Var(X;). 
i=1 


Remark 80. Another way to think of this (in the “Pythagorean theorem” sense) is that we can think of vy; = X; -—E(X;) 


as vectors and note that all scalar products (vj, vj) for i #j are zero by assumption. Thus, the Pythagorean theorem 


tells us that the variance Is 


i=1 


1 n 1 n 2 1 n 
vir (29x) = [E9 al] = 2 Yor 
i=1 2 i=1 


by the Pythagorean theorem. 


If we now assume that all Xj;s have the same law (that is, they are all identically distributed) as some particular 


random variable X € L?, then 


— 1 Var(X 
Var (X,) = “a nVar(X) = Vert) 


which gets smaller and smaller as n gets larger, meaning that we expect the average to converge to the mean E[X]. To 
prove that, we need to make use of some inequalities we may have seen before (which basically bound the probability 


of having deviations far from the mean): 


Theorem 81 (Markov’'s inequality) 


Let X be a nonnegative random variable. Then for any t > 0, we have the inequality 


EDX] 
Ea 


P(X >t) |=E[UX>t}]}<E B u 


Corollary 82 (Chebyshev's inequality) 


Let X € L? be a (not necessarily nonnegative) random variable. Then for any t > 0 (by Markov’s inequality), 


t t? 


= 2 
P(X — E[X]| > t)]= P(x - xyes 2) se | ea) | _| Var(x) 


If we now apply Chebyshev's inequality to our expression for Var (Xn), then we get the following result: 


Theorem 83 (L? weak law of large numbers) 


Let X, Xj are uncorrelated and identically distributed. Then the empirical mean X, = eet Xj; converges to 


E[X], in L? and in probability, as n > oo. 


(Note that on probability spaces, convergence in -measure becomes convergence in probability.) For complete- 


ness, we'll write out the proof in full: 


Proof. From our above calculation, we have 


n 2 
eis ae : Var(X 
[Xn -EX||5 =E (2 > OG< x) = et) 
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— 2 
Because this goes to 0 as n > ov, we do indeed have X, = #[X]. For convergence in probability, note that for any 


€ > 0, we have 


Var Xp 
lim P(|Xq — E[X]| > €) < lim —-" =0, 


n->oo n->oo E 


which is the condition we wish to prove. 


This proves the simplest version of the law of large numbers, and the main idea is the geometric fact that if we 
are given a bunch of orthogonal vectors v1, vo,--:, their average will have small norm. And next time, we'll see how 


to extend these types of results beyond L? geometry. 
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Last time, we started studying moments of random variables, leading us to define the variance Var X = E[(X —EX)?] € 
[0, oo) for any random variable X € L? (meaning that the L? norm ||X||2 = E[X?]}/? is finite). We then proved the 


L? weak law of large numbers, which states that given pairwise uncorrelated, identical distributed random variables 


X,X; € L?, we have X,; = 4 X; converging to E[X] in L? and in probability. We'll strengthen this result in a few 


ways today, by relaxing the L? assumption, showing almost-surely convergence under some conditions, and generalizing 


to triangular arrays. For the rest of class, consider an array of random variables of the following form: 


Xia 
X21 X22 
X31 X32 X33 


Defining S, = i Xn,« to be the sum of the nth row of the array, our central question of study is whether 3a 
converges to a constant for some deterministic b, (usually we'll have b, = n, but we'll see some other examples as 


well). We'll first convert the proof from last time to this new notation: 


Theorem 84 (L? weak law for triangular arrays) 


Suppose we have a triangular array X;,; as above, and assume that Var(Xn,~.) < C < oo for all k,n. Then — 


converges to 0, in L? and in probability, as n > oo. 


Proof. By the Pythagorean theorem, because the S;s are uncorrelated, we can write 


Pot eee a 
= m2 lISn Salle < oo  Var(Xnk) < a 
k=1 


S, —E[S,] 
n 


which goes to 0 as n -+ 00, showing L? convergence. Convergence in probability again follows from Chebyshev (much 
like Theorem 83) because Var (S=#4d) = 4Var(Sn) < £. 


Under the L? assumption, it was okay for us to just assume that all of the vectors were pairwise orthogonal, but 
we're going to have to work with something stronger going forward. So from now on, we'll assume independence of 


the random variables in each row. 
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Theorem 85 (Weak law for triangular arrays) 


Take the same triangular array as above, and assume we have a sequence b, — oo satisfying 


‘ ECC Wels 8 
S > P(Xn,kl > Bn) > 0, DS nk es 0) >O0asn—->oo. 
k=1 k=1 n 


Under these assumptions, define the random variables 


n 
Mee EG ee ti Ge 
k=1 


converges to 0 in probability. 


This theorem is stated in an ugly way so that it is easier to prove — we are defining the random variables 7, because 


the expectations E[S,] may not be defined in general. 


Proof. First, we show that S, is generally close to 7,, and we can in fact show that S, and 7, are equal with high 
probability. Indeed, 
n n 
(Sn Z Tn) S > PCG Yar) = 5 PU iXael > ba) 
k=1 k=1 


because T, is the sum of truncated X45, while Sp is just the sum of the original Xp,~s. This right-hand side goes to 0 


by the first of our (convenient) assumptions, so S,—T, — 0 in probability, and therefore Sac tn — 0 in probability as well 
Sn-E[Tn] _ Tn-E[Tn] | Sn—Tn) 
bn _ : bn . 


(since by + co). Thus, it suffices to prove the statement for 7, in place of S, (because 


in 


But we've designed 7, so that it is nicely bounded: specifically, we have 


1 n 
=a S- Var(Yn.k) 


2 1 k=1 


2 


Th = ET ny 
by 


since the Y;,45 are uncorrelated — they are functions of the X,,«s, which are independent by assumption. (We do need 


the Xp,«s to be independent, because just knowing that the X,,«s are uncorrelated wouldn't guarantee that the Y;,4S 


are uncorrelated.) Then because Var(Yp,~)* < E[Y,,«]*, we can further bound 


Ss iT 


2 12 
< E(Yn,x)* 
bn ‘ — b2 S- ( nk) ’ 


a 


which goes to zero by our second assumption. Thus we have convergence in L? for T, and thus convergence in 


probability, proving the claim. 


We can now use this “ugly” theorem to prove nicer results: the rest of our theorems today will involve tid variables, 


specifically those in a triangular array as shown below: 


Xi 
Xe. 3G 
%, 36% 


In particular, our columns are now strongly dependent on each other, but the rows are independent if our Xjs are 


independent. 
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Theorem 86 (Weak law of large numbers for tid sequences) 


Let X, Xx be independent and identically distributed random variables, such that g(x) = xP(|X| > x) goes to 0 


as X + oo. Then as n-> ov, = — [tn converges to 0 in probability, where 4, = E(X; |X| < n). 


Notice that this condition does not require us to assume that E[X] is finite (which is equivalent to E|X| being 


finite, because we need both the positive and negative integrals to be finite). In particular, E|X| = f>° P(|X| > x)dx 


(as we proved on our homework), and the condition that g(x) — 0 only means that P(|X| > x) decays faster than + 


as xX — oo. But there are functions that decay faster than 3 that aren't integrable — for instance, 


[ ax = log log x|? 
> Xlogx — aaa 


diverges as b + oo. So that’s why we define 4, in the way that we do — the assumptions on our random variables are 


slightly weaker than having a finite first moment. 


Proof. We wish to apply Theorem 85 with b, = n (and having X;, take the role of X,,%). To do so, we just need to 


check the two conditions. For the first one, we can rewrite 
n 
S5PUXn4| > bn) = nP(\X| > m) = g(n), 
k=1 


which goes to 0 by assumption as n + oo. For the second condition, define Y, = X,-1{|X| < n}, so that we must 
E(¥;) _ E(%p) 
= 


n 


prove that n- — 0 as n— oo. But again using that E[X] = i P(X => t)dt for a nonnegative random 


variable X, ‘ ‘es i ee 
= (¥7) = =f P(Y¥? > t)dt = =f P(\Yn| > Vt)dt. 
0 0 


n 


Making the change of variables y = \/t, dt = 2ydy, our integral becomes 


al 1° 
*R(y2) = — | P(|Yq| > y) « 2ydy. 
n n Jo 


We can then cap the upper limit of the integral at n, because Y, is always at most b, = n and thus P(Y, > y) = 0 
above that point, leading us to 


n 


1 1 f[?" 1 f°” 2 
rE(V2) == [PY 2 y2vdy == f Pin xl 2 y)2vdy <= [ aly)ay. 
0 Nn Jo n Jo 


But since this last expression is twice the average value of g on [0, n], and g decays to zero (and has a finite supremum), 
[Tn] 
n 


the average will also go to zero as n> oo. Thus both conditions of Theorem 85 hold, and fp is the value of in 


that theorem's language, implying the result. 


Corollary 87 


Suppose X, Xx are lid random variables with finite mean E[X] = w. Then Sa converges to pu in probability. 


Proof. Similarly to Markov’s inequality, we can write 


g(x) = xP(|X]| 2 x) = E[x; |X| = x] < El|X]; |X] 2 x]. 


Now notice that the expression (|X|; |X| > x) = |X|-1{|X| = x} goes to 0 as x — oo almost surely, and all |X|; |X| > xs 


are bounded from above by |X|, which is integrable by assumption. So because the expectations E[|X|; |X| > x] are 


43 


Lebesgue integrals, we can apply the dominated convergence theorem to find that g(x) > 0 as x — oo (specifically, 


we can apply it to the functions |X|-1{|X| > x} for integer x and notice that these functions are nonincreasing for all 


real x). Therefore, 2s — Ln goes to 0 in probability by Theorem 86. It remains to check that Ww, — pu, but we have 


bb — by = E[X] — E[X; |X] < n] SE[X]; |X] > a], 


which, just like before, goes to zero. Thus Sa — p, converges to 0 in probability, as desired. 


We'll now show a stronger form of convergence under these same conditions: 


Theorem 88 (Strong law of large numbers) 


Suppose X, Xx are lid random variables with finite mean E[X] = w. Then Sa converges to 4 almost surely. 


Proof. \We may assume without loss of generality that X > 0, because we can write X = X,—X_ and prove separately 


that the average of the (X,)4s (resp. (X,)_s) converges to E[X 1] (resp. E[X_]) almost surely. For this proof, we 


need to use a different truncation than before, defining 
Vn = XK 1{ XK < kf. 


In the previous proof, we truncated each row of our triangular array at the same level n, so the sum of a given row 
would be }>p_, X«1{|Xx| <n}. But this time, we instead have 


n 
To = > XliXe Sky 
k=1 


The first step of our proof is again to show that S, and T, are close. Notice that 
So P(Xk A Ye) S DO P(X =k) S DO P(X =k) 
k=1 k=1 k=1 


because the X;s are all identically distributed as X, and now we can convert the right-hand side from a Riemann sum 


to an integral (with overall error at most P(X > 0) = 1) to get 


PX VY) < 1+ f° P(X > t)dt =14+E[X] <0. 
k=1 0 


By the Borel-Cantelli lemma (homework), this means that P(X, 4 Y; i.o.) (infinitely often) must be zero, where the 
event is defined as 
{Xk A Yq 1.0. = {w EQ: Xx(wW) F Y~(w) for infinitely many k}. 


ees Sr-T, . i a Bat 
This implies that a a — 0] almost surely, because with probability one, there are only finitely many indices where 


Xx —Y will be nonzero, so S, — T, converges to some finite value C. (Thus Saat is then bounded by c which goes 


to zero.) 


E|Tn] 
n 


Next, we can also show that — | almost surely, because 


Tn] — ElSn-Trl — lyoury. 
i = ; See 


As we've shown before, E[X; X > k] + 0 as k — oo by the dominated convergence theorem, so the right-hand side 
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(which is an average of these types of terms) also converges to 0, as desired. In other words, the two boxed statements 


Ell Tn—E[Tn 
n 


above allow us to replace S, with T,, and uw with , so It now suffices to show that ] _; 9 almost surely. To 


do so, we'll start by calculating the L? norm as ae We have 


2 


fe ieee 
=< S/Var(Yx) < a SO EIY{]. 
2 k=1 k=1 


However, this expression is more difficult to work with than the ones in previous proofs, because our X,s aren't even 

V2 

LY] 
k2 


Tn —ETn 
n 


assumed to have finite variance. Suppose we knew that }°;° , is finite and equal to some value A. Then we 


could bound the L? norm by breaking up the sum into parts: 


ene ". (k\? Ely2 
7 ea= > (5) 
k=1 


k=1 
logn\* E[¥Z] , o[Y2] 
= > ( n ) RQ 2 k2 
k<logn k>logn 
logn\? iY] 
<( - ) Ae Ss a 
k>logn 


The first term would then go to 0 as n — oo, and so would the second because we'd be taking the tail of a finite sum. 


So we will calculate the expression in blue above, converting to an integral just like in Theorem 86, to get 


oo 1 ; oo 1 co 
ee svél= do za f 2yP(X = y)dy. 
k= k=1 


a 


Because the integrand is positive, we can swap the sum and the integral to get 


> eal = [ Pxzy) 2y Ye dy. 


k=1 k>y 


Now the inner bracketed term is uniformly bounded over all y > 0 by some finite constant c, because it is bounded by 


2y times a constant for small y and the inner sum is proportional to ‘ for large y. Thus, we can write 


Co 1 love) 
So eal sc [PS y)dy = cBIX] < 00, 
a 0 


as desired. Thus the L2 norm of To“ElTo 


goes to zero and we have convergence in probability as usual, but that’s not 
what we wanted to prove. Instead, notice that this calculation gives us a useful rate of decay. For any subsequence of 


the integers n(£) (going to infinity), we have (by the same logic but now summing over the subsequence) 


n(£) 


2 
al 
S » n(£)2 » a[Ye]. 


= 1{n(£) > k 
=) [Vil aoe y 


k=1 £1 


So if we can pick a sequence n(£) such that ye Hingest — zz we will be in the case above (where the blue expression 


is finite). Taking n(2) = a for some a > 1, we do indeed find by a geometric series bound that Ver Maes < co 
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for some constant C(a) and for all k > 1. So the argument above with the L? norm tells us that 


a 


£>1 


2 


< oO. 
2 


T= Ei 4 
atl 


Therefore, for any € > 0, ie tlio 


> e€ only occurs finitely many times almost surely (because the sum of the 


probabilities of these events is finite by Chebyshev’s inequality, and then we can apply Borel-Cantelli). So Tat ElTat] 


—0 
as £— co almost surely, and we've shown almost-sure convergence along a subsequence. Finally, for the full sequence, 
we can use that the Xs are nonnegative. In particular, for any a’ < n < a!) because Tye < Th < Toes and 


aft! > n > a’, we also have 
Tat 7 Wiig Take 


attl — n- aQé 


Taking £— oo and using almost-sure convergence along a subsequence, we get 


LL 


ere T, 
= < lim inf —" < limsup  < ap. 
a n n 


. . . . . Th Th soe 
Finally, taking @ | 1 shows that the lim inf and lim sup of +? are both 4, so —* converges to yz almost surely. Combining 


this with the fact that Saat — 0 almost surely (from above) finishes the proof. 
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Class today started with an attendance quiz “not intended to be difficult:” 


Problem 89 
Let X be a random variable satisfying 
se Al, 


P(X <x)= 
<> I, 


Calculate E[X?; X < n] without explicitly finding the probability density of X. 


Solution. We're given the tail bound P(X > x) = 4 for all x > 1, and we wish to compute 


u[X?; X < n] =E[Y*], where Y = X-1{X < n}. 


To do this, we can use the useful formula (like we used in last lecture) 
E[Y?| -| P(Y? > t)dt. 
0 


If we now change variables by setting t = y?, dt = py®~!dy, this simplifies to 


[oe n n 
Y?] = / py’ *P(Y > y)dy = | py’ *P(Y > y)dy= | py’ P(y < X <n)dy, 
i¢) 0 0 


at which point we can plug in P(y < X < n) == — 4 and directly integrate to get the answer. 


1 
y 


(The main takeaway is that this strategy gives us a way to bound E[X?; X < n] without needing an explicit density 


function, as long as we know something like P(X > x) < £.) 
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Last lecture, we finished by proving the strong law of large numbers, which states that if X and X; are iid and E[X] 


is finite, then nl Xj; converges to 4 almost surely. Today, we'll work “around that value” and find bounds on the 


probability that 2a iS approximately + € for some constant €. We'll start with a special case: 


Example 90 


Suppose X ~ N(0,1) is standard Gaussian with density $(x) = Te exp (- 


In this case, Sp = >, X; ~ N(0,n) is also Gaussian with variance n, meaning that 5 ~ N(0,1), Sp has 
probability density @ (=) A, and the strong law of large numbers tells us that Sa — 0 almost surely as n > oo. But 
e 


we might be interested in behavior away from the mean as well — for example, we may want to calculate 


P (= > :) =P (N(0,1) > eVn) = - o(z)dz. 


Since @ decays very fast, this is approximately on the order of @(€./n) = Te exp (48). And we can show more 


precisely that the integral is in fact exp (-¥ =F o(n)), so the corrections are not leading order, and differing from 


the mean by € is exponentially unlikely in n. 


Example 91 
Suppose X ~ Ber(p) is a Bernoulli random variable, meaning that X is 1 with probability p and 0 with probability 


(1—p). Then S, is Binomial ~ Bin(n, p) with probability mass function P(S, = k) = (7) pX(1— p)’""*. 


Again by the strong law of large numbers, Sa approaches p almost surely, and the central limit theorem further 


tells us that e e 
nH n— Np ~ N(0,1). 


0 Vnp(1 — p) 


We haven't proved the central limit theorem yet, but we can do a heuristic calculation to explain why we do get an 


approximately normal distribution. Applying Stirling's approximation n! ~ /2amn (2)’, we find that for anyO<x <1 


where nx is an integer, 


2mn (n/e)” 


, ~ p17 pjyn-x) 
V2mnx,/2mn(1 — x) (nx/e)™(n(1 — x)/e)na-*) p’*(1 — p) 


n 
P(Sp = nx) = @ p™(1— pr) wv 


(This approximation is good when x is not too close to 0 or 1, because then the factorials in the binomial coefficient 


are all large.) The (n/e)” factors cancel out in the top and bottom, leaving us with 


Ax _ n(1—x) 
rs m= ects 2)" (28) 
2mnx(1— x) \x 1—x 


which can be rewritten in terms of an exponential as 


P(S, = nx) ~ 


ee ( n {xtog * + (1 xa). 


This inner bracketed term is often denoted /,(x) or H(x|p), and it is also called the binary relative entropy. Re- 


membering that our goal is to make this look like a Gaussian, and specifically that z = Spm — MUTA) should 
; V/np(i—p) —-v/np(1=p) 
be approximately standard Gaussian, we're motivated to make a change of variables from x to z. Implicitly, we need 


om) (so S, is pretty 


Z to be bounded (O(1)) for the next calculation to be valid. Specifically, assuming that x = p+ 
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close to the mean), we can plug in x = p+ 2 /np(1 — p)z to get 


P (Sp =np + /np(—p)z) ~ (eo =22)) 


1 
ex nl 
Vonnp(l—p) ° (> n 


where we have replaced the xs in the denominator of the prefactor with p (which is allowed because x is close to p by 
assumption). Additionally, because /, is being evaluated close to its minimizer p, we can Taylor expand around that 
minimum. The first two terms of the Taylor series are zero because /,(p) = /,(p) = 0, and thus (after some calculus) 


we have Ip(x) = S + a, Plugging everything back in, we find that 


P (Sp = np + np —p)z) ~ 


— + 


1 
Vv 2mnp(1 — p) — ( 2 vn 


at which point the error term here can be absorbed as a constant of proportionality. So we've recoverved the Gaussian 


—z? a) | 


Sn—np 


/np(1—p) 


density — specifically, our random variable converges to the Gaussian in the sense that for any fixed a < b 


= 2 
Vnp(1 = p) 2a) V 27 NP(1 — p) 2 
Z—np 


1 ae ; 
of S, (since we still actually have a discrete binomial distribution). So this hopefully gives a good sense of how the 


where z sums over all points in the lattice within [a, b] to guarantee that we only sum over integer values 


central limit theorem statement will look in general. 


Remark 92. /t’s important to note that the calculation above does not actually tell us how to estimate P(S,—np > ne) 
for some constant €. We made the assumption in our argument that S, — np is on the order of ,/n — because of the 
extra \/n term in our Taylor expansion, we can go a little farther out, but we can check that we get all the way to 


Sn — np being linear in n and still expecting Gaussian behavior to hold. 


To actually understand the tail behavior of a binomial random variable, we have to go back to 
n = 
P(S, —np= ne) = PS (1) px(1 _ pn k) 
k2n(p+e) 


We can approximate this by Stirling's formula — because we're now summing over k larger than the mean, the sum 


will be dominated by the smallest values of k. Skipping the calculations, we find that 
P(S,— np > ne) © exp (—n(/p(p + €)) + 0(n)). 


(Here o(n) is not the best possible approximation, but it’s good enough for us because we only care about the leading 
term.) And this is not the same as the naive guess we might have (of just plugging into the Gaussian density) because 
this time we can't Taylor expand /, around p. In other words, we find that even if we have the central limit theorem 


result above, we still have 


Sn — np ne ( ne? ) 
P = exp | =~}. 
(att ati | 2p(1 — p) 
Our main goal for next lecture will thus be to find conditions for when iid random variables X, X; with finite mean 


satisfy a relation of the form 
P(S, — np > ne) = exp (—nf(e, w) + 0(n)) 


for some function f. This is a subject called large deviations theory, which is pretty deep, but we'll only study it 
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briefly in this class. We'll finish today with a final example which has some important large deviations concepts: 


Example 93 
Consider a k x n table of bins, where we fix the number of rows k and study the limit as n — oo. Place nkp balls 


into the bins of the table uniformly at random (where each bin can hold at most one ball). We wish to estimate 


the probability of the event F = {every column has at least one ball}. 


First of all, we must have nkp > n (so that there are enough balls to have one per column). For simplicity, let’s 
assume this inequality is strict so that we have kp > 1. 

We'll start with the simple case k = 2, in which the probability can be calculated exactly. In this case, we have a 
2 x n table with 2np filled bins. If all columns have at least one ball (that is, if F occurs), then nx of them have a 
single ball and the remaining n(1— x) have two balls, where nx + 2n(1—x) =2np = x = 2(1—>p). The probability 


of this occurring if 2np of our 2n bins are randomly chosen to be filled is then 


a @ e in ea oe a) an 


because we pick which nx columns only get one ball and whether the top or bottom bin is filled. (And if we want 


a nicer expression, we can always use Stirling's approximation to get an asymptotic bound.) But even moving up to 
k = 3 already complicates the problem — we could separate our columns into those with one, two, and three occupied 
entries, but now the number of each is no longer uniquely determined just by the total number of balls (for example, 
we could fill two adjacent columns with one and three balls, or with two and two). So combinatorial calculations won't 
be able to get us an immediate answer. 

Instead, we'll turn to probability to help us solve this problem. Let 6 € (0,1) be a parameter (which is arbitrary for 
now), and let X1,--- , X, be tid Bin(k, 8) random variables. We claim that regardless of the value of 6, we have (here 


P» denotes the probability measure coming from the Xjs) 


PA)=Po(x zi tora isn 


yx, = ve) : 


i=1 
To understand this, the idea is that X; represents the number of balls in the Ath column if we fill every bin independently 
with probability @. Then if we condition on the total number of balls 5+”, X; being nkp, we're indeed placing nkp balls 
into the bins uniformly at random, and we want each column to have at least one ball. For notational convenience, 
define G = {Xj > 1 for all 1 <i <n} andS= yy X; = nkp. By Bayes’ rule, we can rewrite the probability that 


we want as 
PeoGNS) — Pe(G)Pe(S|G) 


Pe(S) P(Bin(nk, 0) = nkp) 


All of the terms on the right-hand side can now be dealt with relatively easily. The denominator can be calculated 


P(F) = Pe(G|S) = 


using the work in Example 91, and Pg(G) is much easier to calculate than Pg(G|S) because the events in the various 
columns are now independent (meaning that it is just P(Bin(k, @) > 1)"). Finally, the conditional probability Pg(S|G) 
can just be bounded from above by 1. This leaves us with 


P(Bin(k, 6) > 1)" 
P(Bin(nk, 0) = nkp)' 


Pe(G|S) < 


and now we can get the best possible bound by optimizing the right-hand side over 8. And now we may want to know 


whether the optimal 6 actually gives us a bound on P(F) — to see that, we can take a closer look at the term we 
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bounded by 1 and write it out as 


n 
Xi>1forali<i<n) =» (So¥.= ko), 


i=1 


n 
Pe(S|G) = Pe (>: Xj = nkp 
i=1 
where Y; is distributed as X; conditioned on {X; > 1}. For our bound to be good, we should try to make this probability 
large, so it makes sense to choose @ so that EgY = kp. Explicitly, this means that we want to pick @ such that 


k@ 
(1—6)k 


Bo(X|X > 1) = — = kp. 


Picking this value of 6, we know that uy converges almost surely to kp as n — oo, and with a bit more work 
(specifically the local central limit theorem), it can be shown that P(}°> Y; = knp) = exp(—o(n)), meaning that our 
estimate of P(F) is fairly good. 

In summary, this example shows that a specific counting configuration problem can be made simpler with probability. 


Next lecture, we'll look at the more general question we've proposed about large deviations! 
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(Another homework assignment has been posted, and it will be due in a week.) Last lecture, we studied sums of 
independent random variables. In particular, we've shown that if X,X; are lid with finite mean, then their average 
approaches fu, but we're now curious about the probability that this average is more than 4 + € for some constant 


€ > 0. Last time, we did a calculation to verify that when X is standard normal, we have 
ne? 
P(Sp > ne) = exp a. +o(n)), 


and similarly when X is Bernoulli with probability p, we have 


P(S, — np > ne) = exp(—nlp(p + €) + 0(n)). 


Today, we'll see that these formulas are part of a more general structure. 


Definition 94 


The moment generating function (also mgf) of a random variable X is the function m(@) = 


Ox 


Because e’* > 0 for all 6, we have m(@) € (0, 00] for any @. Additionally, we always have m(0) = 1, but it’s 


possible that this is the only finite value for the moment generating function. 


Lemma 95 


For any random variable X, let m(@) = E[e**], and define 


6, =sup{@:m(@) < oo}, O_ =inf{O: m(O) < co}. 


(Because m(0) < co, 64 € [0, co] and 6 € [—oo,0].) Then on the open interval (6_, 6+), m(@) is a smooth 
function with kth derivative m‘)(6) = E[X*e®*] < oo. Additionally, if 6, > 0, then m“)(6) > E[X*] as 6 | 0. 


While we can’t generally exchange derivatives and expectation values, this lemma tells us that we can indeed do so 


for the moment generating function. We won't do the proof in full here, but to show that m(@) is finite on the interval 
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(6_, 64), notice that for any a< b<c, e%* < e* + e™, so E[e*] < E[e?*] + E[e*] (meaning that if the moment 


generating function is finite at a and c, it is also finite in between). Smoothness follows by checking the definition 


of the derivative (which also exists on (@_,@,) because E[e**] is a power series in @), and the last statement follows 


from the dominated convergence theorem. 


Definition 96 


The cumulant generating function (also cgf) of a random variable is the function «(@) = log m(@). 


We're now almost ready to answer or question (about calculating P(S, > na) for some a> 4), and we just need 


a bit more notation. Let Xmax be the essential supremum of the random variable X, meaning that 
Xmax = sup(supp Lx) = sup{x € R: P(X > x) > O}. 


The idea is that nothing interesting happens above Xmax, because for any a > Xmax, P(X > a) = 0, so P(S, > na) = 0 


(each term in the sum S,, is less than a, so their sum must be less than na). And if a = Xmax, then P(S, > na) = P(X = 


a)” (because we must have all terms exactly a to have sum na), which can be positive if we have a point of positive 


measure at a. So we already know the answer in those cases, and from here on we will assume EX < a < ess sup X. 


(In particular, this also rules out the case where EX = ess sup X, in which case the distribution is concentrated entirely 


at one value and nothing interesting happens.) 


Theorem 97 (Cramér) 
Let X,X; be iid random variables. If m(@) = E[e®*] is finite for some 6 > 0, then E[X] = pu € [-00, 00) is 
well-defined (by Lemma 95). In addition, for any uw < a< Xmax = ess sup X, the sum S, = es X; satisfies the 


“large deviations principle” 


— lim e log(P(S, > na)) = /(a) = sup [@a — K(@)] = sup [8a — K(8)]. 
noo nN e>0 6ER 


To explain the bounds  € [—o0, 00), E[e**] being finite for some @ > 0 rules out the case where X is really 
Ox 


big (since eX > x for sufficiently large x), but it doesn’t prove that X cannot be really small (so we could have 


iX4—EX_ =-—oo). The function &*(a) = supgep[8a — K(O)] is often called the Legendre dual of the function «. 


The actual statement of the large deviations principle is a bit more advanced, but we can read that on our own: 
the central idea of our first equality is still that P(S, > na) = exp(—nl/(a) + o(n)). Intuitively, this result says that 
there may be many ways to have X1,--- ,X, add up to an atypically large value na, but that it’s (intuitively) unlikely 


to have a case like X; =--- = X,_1 =O and X, = na. 


Proof. We'll first prove an upper bound on P(S, > na). For any 6 > 0, we have 


n(e") (8)? 
enba enba 


P(S, > na) < P(e%" > e?) < = exp [—n(@a — K(8))], 


where we have inequality instead of equality in the first relation only because of 6 = 0, and where we apply Markov's 


inequality in the second. Since this inequality holds for any 6 > 0, we thus have 


P(Sp > na) < exp |—nsup(0a — K(0)) 
a>0 


Next, we'll prove that the last two expressions are equal (so the supremum is the same whether we take 6 > O or 
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6 € R). For any 6 € (6_, 6,), define a probability measure P, which assigns to any event A the probability 


Po(A) = fn | | 


where the expectation E is taken relative to the original probability measure. We'll cover this kind of operation more 


in a later lecture, but it is basically a reweighting of the original measure P (known as an exponential tilting), and it 


ox 


is a probability measure because P9(Q) = E || = FO} ‘[e°X] = 1. Thus for any measurable function f, 


Ox 
Ea[f(X)] = E Fx | 


(because this is true for simple functions by the definition and then we can approximate measurable functions by simple 


functions). This also means (replacing f with fm) that Eg, [Foo | = E[f(X)], and in particular 


K(8) = log Efe] K'(8) = 


‘alX], 


m(8)  E[Xe*] [xe] | 
m(@) i[eo*] ; Fol 


where we've applied Lemma 95 to m’(@). We can also calculate the second derivative of K to be 


nay (8) (m'(8)\?__ E[X2e%]— (E[Xe%*]\? en 
= Tay (Gray) = “are (Geren) = PI Babe = vant, 


which is strictly positive for any 0 because we assumed that “ < ess sup X and thus X is not concentrated at a single 
point. In other words, «(@) is a strictly convex function on (6_,6,). And because /(@) > uw as 6 | 0 (again by 
Lemma 95), we may pick some sufficiently small @ = 0) such that u < k/(8)) < a. We then find that 


(0a — K())ge, = a—K'(80) > 0. 


Additionally, 6a — K(@) is strictly concave on (6_,6,) (because @a has zero second derivative, and then we subtract 


K(@)), SO & is increasing on (6_, 9). In particular, this implies that we indeed have 


sup(@a — K(@)) = sup(@a — K(0)) |. 
a>0 6eR 


Finally, we can prove the lower bound — we'll only do it in a nice case, assuming that the supremum supger (8a — K()) 
is actually attained at some value 0, € (@_,84). This is an assumption which doesn't always hold, and the theorem 
is true without making it, but this will help us avoid some infinite behavior. We have just shown that 6, > 0 — pick 


some 6 € (63,04). Since @, is a critical point of 8a — K(@), convexity of & tells us that 
k'(0) > K/(@3) = a. 


We'll now look at the change of measure applied to the Xjs, defining 


n n eoXi 
Pe (11 1{X;<€ ai) =E iW (11x € a5) 


i=1 


for any events A;. (In particular, the X;s are also iid under Pg because of the product form of this equation.) We may 
then extend this to measurable functions in the same way as above, and in particular we can write (notice that we are 
now changing measure on the right-hand side instead of the left) 


P(S, > na) = Eg we -1{S, > na}| : 
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We are trying to prove a lower bound, so it is okay to decrease the right-hand side: picking some b > k’(@), it is also 


true that (6) (6) 
_ | m(e)” _ |m(@)" 
P(Sp > na) > Eg ae 1{S, € [na, nb) > Eg ae -1{S, € [na, nb] }] , 


because S, > nb within this event {S, € [na, nb]}. But here’s the magic: under our changed measure, Eg[X] = 


k/(0) € (a, b), so Po(S, € [na, nb]) > 1 as n > co by the law of large numbers. In other words, doing our change of 


measure has turned a rare event into a typical event! Thus, we indeed have 


PCS, > nay]> OM (a — o(1)) =[— od) exp [mlb — KOM 


for any 8 > @, and b > k’/(@). Taking 6 | 8, and b | «/(@,) = a and then sending n — oo gives us the desired lower 


bound (because @, is where the supremum is achieved). 


Something magical is happening here — the convenient exponentiation in the upper bound when we used Markov’s 
inequality was a bit arbitrary, and we also performed a pretty arbitrary-looking change of measure to get the lower 
bound, but they turn out to yield the same Legendre dual. So in some sense, exponentiating is optimal on both sides 
of the inequality, and we now have another way to understand the theorem statement: “the most efficient (probable) 


way to achieve a large deviation S, > na is to have X1,--- , Xp all behave like a collection of samples from our tilted 


measure Pg, choosing @ such that Eg[X] = a.” 


Example 98 
In some special cases, we can calculate these large deviations more explicitly, and we'll finish this lecture by doing 
so in the case where Ly has finite support of size k. Suppose we have a random variable X which can only take 


on the values x ,,--- , Xx, with probabilities 71,--- , 7, respectively. 


Definition 99 


The empirical measure of the random variables X1,--- , Xp (in general, not just in the case of finite support) is 


x 1 : 
Ee = Oe 
f=1 


(In other words, we sample the X;s and put a weight of 4 at each achieved value.) i is a probability measure 
on R, and it is a random variable that is o(X1,--- , X,)-measurable. In our particular case where the support of X 
is finite, we can express the empirical measure as a vector L* € [0, 1]*, with jth component equal to the empirical 
fraction 
(LE); = * (#1 <i <0: X= x}. 


Since there are only finitely many possibilities for LX, we can explicitly calculate any particular probability 


n| é 
Pls = 4 = War M0) = Caayitmpayi = Catoayh TR 


(this is the probability that a v; fraction of the random variables land on 7). Expanding the factorials with Stirling’s 


formula and simplifying, we find that 


P(L* = v) = exp [—nH(v |r) + o(n)], 
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where 
k 


; 
H(vln) = Yo = yjlog ! = Dax (vIn) 
J 


J=1 
is the relative entropy or Kullback-Leibler divergence between the two measures (though note that this expression 
is not symmetric between v and 7.) This is just a combinatorical calculation, but we can now do something with 
it. Suppose the support of P contains all k values x1,--- , xx, so that H(v|7) is strictly convex in v. Notice that S, 


doesn't depend on the ordering of Xjs, and in fact it is a function of i because 


k 
S,=ny 4h yank, 


J=1 


For any a such that E[X] < a < ess sup X, we thus have (summing over all values v that LX can take) 


P(S, > na) = SP =v)-1{n(x,v) > na} = S- 1{(x, v) > ab exp(—nH(v|7) + o(n)). 


Vv 


But the number of values vy is polynomial in n (by stars and bars), which can be absorbed into the o(n) term in the 


exponential. In other words, we find that (taking the largest exponential across all vs) 


P(S, > na) = exp [-nint{H(v|m) > (x, Vv) > a} + o(n) 


To find the optimal v, we now have a constraint optimization problem with Lagrangian (remembering that we have 


the constraint that our probability needs to sum to 1) 


L=H(y|r) + (1-4) +6 [ a—So gy 
Jj J 


and taking its derivative gives us the solution 


OL mje" 
oy ee logaj —p—-O0xj =0 = Y= — ; 


where we should choose the normalization constant to be C = m(@) so probabilities do add to one. But this is the 
same change of measure Pg, as we talked about in the proof of Theorem 97! (This time, it comes from the final 
optimization problem rather than pulling them out of nowhere.) In other words, when we are trying to achieve this 
large deviation event S, > na, the primary contribution (the slowest decaying exponential) comes from LX ~ vy = Po. 


And as an exercise, we can also show that 


P(Sp >na)=exp}—n sup H(Pelt)+0(n)] , 
6:E9[X]2a 


with optimal @ being achieved at equality Eg[X] = a. 


12 October 16, 2019 


(Our first midterm is on Monday, so we should make sure to remember to show up to class.) Today, we'll cover some 
loose ends that have been mentioned but not in detail, and we'll also introduce the Fourier transform (which is a topic 
for after the exam). Part of this is fair game for the exam, but this will be communicated clearly. We'll start with a 


sample problem for the midterm: 
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Problem 100 


Prove that the law of a real-valued random variable X is uniquely determined by its cumulative distribution function 


AGS) SOx < 50) 


This is the level of formality we should expect: we should understand, for example, that X is a random variable on 


some probability space (Q, F, P), and its law is defined as Lx = X4P = yu, which is a probability measure on (R, Ba). 


Solution. The cdf F(x) = P(X < x) is also the measure ~((—co, x]), so knowing pw also determines F. For the 
converse, suppose there are two measures i, Y on R that both correspond to the same cumulative distribution function, 


so F(x) = u((—co, x]) = v((—0«, x]) for all x. This means that 


u((a, b]) = F(b) — F(a) = v((a, 4) 


for all half-open intervals. Let F be the set of Borel sets such that u(B) = v(B) — we need to show that F = Br, 
which we can do with a m-A argument. Specifically, F contains the set of half-open intervals, which is a m-system, 
and we can check that F is a A-system as well (by properties of a measure). So by the 1-A theorem, F contains the 


o-algebra generated by the set of half-open intervals, which is Bp. Therefore 4 = v and the measure wu is uniquely 


determined by F as well. 


Remark 101. We won't be allowed to bring cheat sheets, but some useful things will be written on the cover page. 
(However, it is reasonable for us to remember, for instance, the full weak law of large numbers.) If a result follows 
directly from a theorem in class, we can cite the theorem and just check that all the conditions hold. But if we're 


instead being told to reproduce a very similar proof, that will be written out. 


For today’s class, we'll start by discussing a probabilistic concept which we haven't talked about too formally: 


Definition 102 


Let X and Y be independent real-valued random variables with laws £x = uw and Ly = v, so that L(x vy) =U@v. 


Let their cdfs be F(x) = P(X > x) and G(y) = P(Y > y). Because yu and v are in one-to-one correspondence 
with F and G respectively (by Problem 100), we may define the convolution 4 * vy = v * to be the law of 
Z=xX+/Y. We denote the corresponding cdf by Fx G=GxF. 


Note that 4, v can be discrete, continuous, or anything in between, and this definition is still valid. To get a more 


explicit expression for F * G, notice that 


(F*G)(z) =P(X+Y <z)= J/ Lx ty < z}du(x)dv(y), 


where we've implicitly used the change of variables formula to write the integral in terms of 4% and v and Tonelli’s 


theorem to break up the double integral. Then the inner integral is just the probability that x < z — y, so we have 


(F *G)(z) = / F(z—y)dvly) = / F(z— y)dG(y) | 


where the last equality is just saying that v and G carry the same information. (And by symmetry, this is also equal 
to [ G(z — x)dF(x).) We can say something even nicer in the case where and v have densities — in particular, 
fu*v also has a density. To see this, suppose wz and v correspond to the density functions f and g. By definition, that 


means that for all Borel sets B, u(B) = [ 1s(x)f(x)dx (where dx is the standard Lebesgue measure), where f is a 
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nonnegative measurable function on R, Bg. In particular, F(x) = fe f(t)dt (by plugging in the set (—oo, x]), and 
for any integrable random variable h(X), we have E[h(X)] = f°>. h(t)f(t)dt. So plugging in dG(y) = g(y)dy, we 
find that 


(FxG)(z) = df ” F(z-y)dG(y) = ; ” F(z—-ya(y)dy = 1 “ / " #(t-y)dt aly)dy = i ‘ i ” F(t—y)a(y)dyat 


by Tonelli’s theorem. So we've now written the cdf F * G as an integral ae * g)(t)dt, and therefore u * v has 


density 


(Feay(z)= f Flz-yaydy= [ az—vF lox] 


We'll now move on to our next topic, the Fourier transform. (This topic won't be on the first exam, and not 


everything mentioned today is necessary for us to know, but it may be useful intuition.) 


Definition 103 
The Fourier transform or characteristic function of a probability measure uw on (R, Bp) is a function ¢, = fi: 
R — C defined by 


u(t) = fe du(x) 


In particular, if X is a random variable with law wu, its characteristic function is b(t) = [eit]. 


Because X is real-valued, e'™ is always on the unit circle in C, so $,(t) is inside the unit ball in C. Thus, ¢,(t) is 
well-defined for all t € IR, and it’s continuous in t (by the bounded convergence theorem). The reason this definition 
is useful (and in particular good for proving something like the Central Limit Theorem) is that the Fourier transform 
turns convolution into multiplication. Specifically, if X is a random variable with law w and Y is a random variable 


with law v, and the two random variables are independent, then we have the useful identity 


Punv(t) |= a [eit(X4)] =E fe] E[e*”] =[d,(t)4.(t) | 


Example 104 


We will indeed eventually study the Fourier transform on R, and we'll see that it captures a lot of properties of 


the measure ws. But today, we'll look at the Fourier transform on a finite space. 


We'll look at the space Q = Z/nZ (the integers mod n), which will keep our story simple. Let V be the set of 
functions f : Q — C, which is a finite-dimensional complex vector space isomorphic to C!@!. If we view these functions 
as vectors, then V has a Hermitian inner product given by 


n-1 


(F,9) = S_ F(x)g(x) = 9*F, 


x=0 


where g* is the conjugate transpose of g. Then for any (nonzero) z € 2, we can consider the translation operator T, 
acting on functions via 
(Tzf)(x) = F(x — z). 


Tz is then essentially sending vectors to vectors, so we can think of it as a matrix — specifically, it is a circulant matrix 


with shift depending on z. We can then calculate its eigenvectors and eigenvalues: f is an eigenvector of Tz with 
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eigenvalue X if and only if for all x € Q, 
Af (x) = Taf (x) = Fe — Zz) => AS F(x) = Fe — kz) = FOO) if kz =O mod nv. 


In particular (plugging in k = n), this means that X” = 1, so any eigenvalue > must be an nth root of unity. Now if 


nis prime, then 7, has n distinct eigenvalues, which are exactly those nth roots of unity, and we can check that the 


Xe(x) = = exp (=) 


with corresponding eigenvalues Az 2 = exp (—22). On the other hand, if n is not prime, we no longer have a unique 


eigenvectors are of the form 


eigenbasis. Luckily, it turns out that x0,--- ,Xn—1 still do form an eigenbasis of Tz for all z € Q, and xz is still an 
eigenvector of 7, with that same eigenvalue Az¢ = exp (— 222). In other words, because our finite space has a 
periodic structure, the translation operators have a particularly nice form. We can also find that the scalar products 


between two vectors in our eigenbasis Is 


(Xk. Xe) = (Xe)"Xk = Di xa(vdxel) = Fh p (matt = ») | 


n 
Pe 


This is 1 when k = @ and 0 otherwise, which means that with respect to the Hermitian inner product, the x;s form an 


orthonormal basis. Thus, the matrix 


US [Xe oe Raat 
[aes | 
is unitary (meaning that U*U = UU* = /). We can call (xX0,--: ,Xn—1) the Fourier basis for V, and the Fourier 
transform is just the change-of-basis operation sending a function f to f=U*f (which basically tells us the coordinates 


of f in the Fourier basis). In other words, 


f =UU"f =UF =S_ (exe 
£ 


is how we write f as a linear combination of the eigenvectors x2, and the coefficients f (2) are given explicitly by 


PQ) = (Ure = (x0) = Sra) = ae =f (x) ex o(- oe) 


Ignoring the constants, this expression should look very similar to how we multiply f by some complex exponential in 
the regular Fourier transform in Definition 103. So the whole point is that the Fourier transform is a change-of-basis 
operation which diagonalizes the translation operators, and the exponentials in the definitionare motivated by the 
discrete case in which they pop out of the eigenvalue calculation. 

But we can study convolution in this finite space as well, because we can think of the expression for f * g in terms 


of translations of one of the two functions: 
(Ff * g)(x) = S— g(z)F(x — z) = S- g(z)(TeF)(x). 
v4 Zz 


Thus, we can think of f * g as applying an operator Cg, which is a linear combination of translation operators, to f: 


Cg = Sgr. 
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But since the Fourier basis diagonalizes all of the 72s simultaneously, we should expect that the Fourier basis is also 
good for Cg. Specifically, we know that T, = UA,U* (where Az is the diagonal matrix with entries Az0,--+ , Az,n—1) 
for each z, or equivalently (expanding out the matrix multiplication) 


n-1 


Tz — yy Az eXexe- 
£=0 


Plugging this into our expression for Cg, we thus find that 


Cg = S)9(z) D> Azexext = >> (= a) XeXe- 
Zz £ £ Zz 


Now because Az. = exp (—2™%) is the zth entry in the 2th row of the matrix U*, the inner sum can also be thought 


of as the £th coordinate of the vector U*g. So we can write C, in matrix form as the product of three terms, 


| | [(U*9)o ] ——. 
i (U*g)1 ° 
Co= [Xo -* XMaet é : | = Udiag(U*g) U*. 
poe | | Loe eee 
(U*9)n-1 . 


But this now allows us to understand how convolution and the Fourier transform interact: we have 

f * g = Cgf = U*Cgf = U*U diag(U*g) U*f = diag(U*g) U*f. 
But now U*g Is the Fourier transform of g, and U*f is the Fourier transform of f. Thus this last expression is the 
entry-wise product of f and g, and we do indeed see that f x g(x) = f(x)9(x), analogous to the identity @yx/(t) = 
ou(t)o_(t) that we derived earlier! So the Fourier transform can also be thought of as the unique object behaving in 
this particularly nice way under convolution, and as we mentioned before, it will play a role in some of the results we 
will prove later in this class. 


We'll finish this lecture with one more sample problem: 


Problem 105 
Suppose X1,X2,--- are random variables that are Cauchy in probability, meaning that that for all e > 0, 


P(\Xm — Xn| > €) goes to 0 as m,n > co. Prove there exists a random variable X such that X, > X in 


probability. 


Solution. We proved in our homework that if a sequence of random variables converges in probability, it also converges 
almost surely along some subsequence. Motivated by this, we'll start by constructing that subsequence. Because the 


Xjs are Cauchy in probability, there is some sequence n, (which we can choose to go to infinity) such that for all 


1 1 
P(IXm~Xol> ae) $3 


m,n = nk, 


Now define the event Ay = {w : |Xn,,,(w) — Xn, (w)| = x}. By Borel-Cantelli, the probability that infinitely many of 
Ai, Ao,-++ occur is 0, so on Q\ {Ax i.0.} (an event of probability 1), these events eventually stop occurring and thus 
the Xn, converge (because their values are Cauchy) to some X. In other words, the subsequence Xp, converges to 


some random variable X almost surely. 
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It remains to show that the full sequence converges to X. By the triangle inequality and a union bound, 


P(|Xn ~~ x| 2 €) < P(|X;, — Xn, | + Xing ~~ x| 2 E) 


< P (IX — Xml 25) +P (Xm — X12 5). 


Taking k — oo, the second term goes to 0 because Xp, converges to X almost surely. And if we also send n — oo 


(so that n and nx both go to infinity), the first term also goes to zero because the X;s are Cauchy in probability. Thus 


P(|\Xp — X| => €) 4 0 as n> cw, which is the desired result. 


In preparation for the midterm, Professor Sun's office hours will be Friday 11-12, and the TAs will have office hours 
as well. 


s 


13. October 23, 2019 


Last time, we gave some motivation for the Fourier transform of a random variable by considering the discrete 
case 2 = Z/nZ. Letting V be the set of functions from Q — C, we have a finite-dimensional vector space iso- 
f(x)g(x). Under this inner product, we found 


morphic to C® with Hermitian inner product (f,g) = g*f = Yiyco 


an orthonormal basis (Xo0,°-: ,Xn—1) for V, with x¢(x) = oa exp (2M). In particular, we found that the matrix 
U= [x0 X1 +++ Xp-1| € C"*” is unitary and gives rise to Fourier transform as a change of basis-operation 
PU Suey, 


In other words, the coefficient f(2) of f in the Fourier basis is (f, x2). It was particularly nice that Fourier inversion 
could be easily performed (since f = Uf) and that (f, g) = (f, 9) because U is unitary. Today's class will now cover 
the Fourier transform for functions on R, and we'll see some of these properties come up again. 

We'll start by fixing the normalization. Recall that if ~: is a probability measure on (IR, Bg), we defined the Fourier 


transform or characteristic function of 4 as 


u(t) = f e™du(x) =B le], 


for a random variable X ~ tz. We discussed some basic properties of this function last time, and we'll formalize them 


now: 


Proposition 106 


For any probability measure w on R, ¢, Is a uniformly continuous mapping from R to the closed unit disk 


D={7€C:|z| < 1}. 


Proof. As mentioned previously, @, indeed takes values on D because e’™ is always on the unit circle in C. To prove 


uniform continuity, we use that 


Idu(t +h) — bmu(t)| = [Ble — el]. 


Applying Jensen’s inequality in the form |E[Y]| < E[]Y|] allows us to bring the absolute value into the expectation and 
find that this is 


<E [let (ei = 1)]] _ 5 [|e _ 1] 
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But the right-hand side does not depend on the point t € R we choose, and it tends to zero as h > 0 by applying the 


bounded convergence theorem to any sequence h, — 0 (and remembering that the integral over a probability space is 


indeed over a set of finite support). 


In the special case where pw has a density function f(x), we can write the Fourier transform as an integral 
a(t) = etfeoax =Fih); 


and we call this the L' Fourier transform (because f integrates to 1, so it’s in L+). There is an important result from 


Fourier analysis that we should keep in mind: 


Theorem 107 (Planchard/Parseval) 
Suppose we have two functions f,h € L1(R)M L?(R). Then f,h € L°(R)M L2(R), and we have 


(f, A) = 2a(f, hy. 


Thus, the mapping U : L?(IR)NL?(R) > L®(R)NL?(R) defined by U(f) = —~ has a unique continuous extension 
to a unitary map L?(R) > L?(R). 


2 
V2 


The idea is that it is nice to be able to have U map a space to itself, but the integral f e'™F(x)dx may not be 
defined if f is not in L?. But instead, we can approximate f by functions in the space L' 1 L?, and it turns out that 
the limit will make sense and live in L?. And having a unitary mapping means that we also have a simple inverse — for 
any function h € L1(IR)M L?(R), we can define 


1 h(—x) 
U* (h(x) = == | oP A(tydt= 
(MOO) = Foe fe Mm(tdae = Se 
(which is like U(f) = x but with an additional negative sign). The logic from Theorem 107 applies here as well, 


so this map also extends to a map U* : L?(IR) + L?(R). This gives us an important result: the Fourier inversion 


theorem for functions tells us that 


fa) = urur = ( : )- EF, 


meaning that up to a scaling factor and a change of sign, the Fourier transform is an involution. Unfortunately, the 


Fourier transform for general probability measures (instead of functions) is not quite as nice: 


Theorem 108 (Fourier inversion formula for probability measures on R) 


Suppose yz is a probability measure on R and @,(t) is its characteristic function. Then we have 


y Sita. ,—ith 
J bskt) (EE) ae Hla, 6) = u((a,0)) + APD. 


The point of this result is that we use yz to define ¢,, but we can also use ¢, to determine yz (it’s left as an 
exercise that even though {fj and yu are not exactly the same, we can use fi to uniquely determine 4). By the way, we 
do need to integrate from —T to T to ensure the integral actually exists, and it’s part of the content of the theorem 


that those integrals converge as T — oo: 
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Example 109 


Consider the random variable X which takes on the values +1 with probability 5. Then 


fe te 


ae Se = = cost. 


This function does not decay as t + oo, so plugging it into the integral on the left-hand side of Theorem 108 


would not work. 


Before we go into the proof, we'll discuss the intuitive explanation for this result. We're basically being given dy, 
and we want to recover the distribution by finding, for example, ((a, b)) for some a < b. In other words, we wish to 
compute u((a, b)) = f h(x)du(x), where h is the indicator function 1/45). Because h has bounded support, we can 


compete the Fourier transform of this indicator function directly to be 
PED =. elta 


b 
A(t) = fe neoax =) e!™dx = —e 
a 


So if we think of u(a,b) = f hdw = (u,h) as some kind of inner product (whatever that means, because p is a 


measure rather than a function), then Theorem 107 should yield 


e'ta _ —itb 
w((a.b)) = 5 (a A) = (bf) =f a() Eat 


In reality, this integral may not even be defined, but we at least see why the integrand on the left-hand side takes 
the form that it does. To understand this further, we can approximate @ by truncating in the Fourier domain by 


integrating the last expression from —T to T only: 


a (0) ae at =f aconce (ey Oa 


2Qnit 


where kr is the truncation function 1{|t] < 7}. We can rewrite this as an inner product =-(, hkr) = (uw, hr), where 
hr is the function satisfying hy = = kr. We can solve for hy explicitly, but we don’t need to — the idea is that A is 
a not-so-smooth function (the indicator of (a, b)) mapped into the Fourier domain, so removing the high-frequency 
domain by multiplying by ky should mean that h7 is essentially a “smooth version” of h. So if that smoothing goes 


away as T —> ov, it makes sense that we have 
; i 1 
lim Ar(x) = A(x) = 1{(a, b)} + = - 1{{a, b}}, 
Too 2 


because h is 0 on one side of a (and also b) and 1 on the other, so the Ars will have half of each contribution. This 
completes the intuition for why we get only half of the measure on the endpoints, and we're now almost ready to dive 


into the proof. We'll first figure out what hr should be so that it satisfies hr = kr: 


Proposition 110 


The function sinc(t) = aut is not in L+(IR), because it decays like . However, the truncated version of that 


function 


SC) = iL sinc(t) 


converges to m as T — oo. 
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Proof. There are various ways to do this, but we'll calculate using complex analysis. We can first rewrite 


To 2it S18 T ait 
s(t) = | f 25 a= [ edt 
_T 2it _y it 


by symmetry. Now the function h(z) = — is holomorphic on C \ {0}, so if we integrate around any closed contour 


7 not containing or passing through the origin, then ¢, h(z)dz = 0. We will integrate over the following indented 


semicircle once counterclockwise and take R > oo, € > 0: 


y 


R 


sint 
E 


compute is the contribution along parts Il and IV of the contour. It suffices to show that (1) as € — 0, the integral 


The integral of sinc(t) from —e to € vanishes as € + O (because — las t > 0), so the integral we want to 


along | goes to —7, and (2) as R — co, the integral along III goes to 0. (2) can be done with some careful bounding 
which we'll skip (by parameterizing the semicircle and then bounding the resulting integral i. e~Fsin® by breaking it 
up into a region with small length and a region with small integrand). Meanwhile, (1) can be done by parametrizing z 


as ee’ for 6 from 7m to 0, so that the integral along region | is 


0 elee” : Oo a 
i hz)dz = | —, iect*ae = [ el" dé. 
i n lee! 


But the numerator converges to 1, so this integral converges to ih 1d@ = —m as € + 0. So because the total contour 


integral is 0, the integral we're trying to find (along II and IV) is 7, as desired. 


We will use this function S(T) to find Ay by applying the Fourier inversion formula for functions. Inverting kr, 


1 i ee 1 F- - itb _ ,ita 
r(x) = 5 / _eAt)dt = / et. (=) a 


If we average between the values at x and —x for each point in the integrand, the integral can be rewritten as 


Tre _ ; 7 
h(x) = [ iS x))  sin(t(a “) ae 


2nt 2nt 


we have 


a The two terms can now be integrated separately — the integral of — 


if b— x = 0, and in all other cases we can make a u-substitution (flipping the limits of integration if b— x,a—x are 


where we use that sin@ = is zero 


negative) to write this in terms of the function S from above. We thus have 


__ sgn(b— x) S(|b—x|+ T)—sgnia—x)S(|a—x|+T) 


hr (x) oT 


where we define sgn(0) = 0. Now if b— x, a—x are both positive or both negative, both terms converge to the same 


value (because S converges to m@ as its argument goes to infinity), so this is only nonzero when x € [a, b]. If x is 


strictly between a and b, we have mC) = 1, and otherwise (when x = a or x = b) it goes to oe = 5. Putting this 


all together, we indeed find that hy converges pointwise to h(x) = 1(a,b) + 5 - 1f¢3,5}}, aS desired. 
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Remark 111. Another way we can see that hr is smooth is that Fourier transforms take convolution to multiplication, 


sO hy hkr hy = hx Sr, where Sy = ky. This function Sy is “wavy” and becomes more oscillatory as T gets 


larger, but it is smooth, and thus the convolution hr is also smooth. 
With this, we're finally ready to explain the Fourier inversion theorem: 


Proof of Theorem 108. We start by manipulating the left-hand side in terms of the functions we've been describing: 


= i $,(t) (=) dt= — te u(t)A(t)dt = =. CZ edu(x) ) A(t)at. 


Using that e’'* = e~'tx and swapping the order of integration (which is allowed by Fubini’s theorem because the inner 


we have 


integral is bounded and the outer integral is a continuous function integrated over a finite domain), 


co fl etx A(t) 


Applying the Fourier inversion formula for functions to the inner integral, we can simplify to 
Ir = f hr@)dulx) = f br(o)du(x) 


(where the last step follows from hy being real by our previous computation). All of this holds for a fixed T, and to finish, 
we need to show that /r = f[ hr(x)du(x) converges to f h(x)du(x) as T > 00 (where h = 1{(a, b)} + 5+ 1{{a, b}} 


as defined above). We already know that hr > h converges pointwise from earlier work, and we know that hr is 


of the form >: (+S(something) + S(something else)). But since S is bounded for both small and large values of 
T, it must be uniformly bounded by some constant overall. Thus, Ay is bounded uniformly as well (supz ||Ar||oo is 


finite), so the bounded convergence theorem applies. This means that the left-hand side /+ does indeed converge to 
J AC) du(x) = u((a, b)) + 5u(fa, b}), as desired. 


Example 112 


For a concrete example of a Fourier transform, we'll compute the characteristic function of a Gaussian with density 


ae 
Bee 


g(x) = = 


Since we have an explicit density function, we can write 


—94 = = itx = 1 = itx ,—x?/2 _ 1 = (x 7 it)? —t?/2 
o(t) = g(t) = e'“g(x)dx = ve ee dx = or exp{——5 J € dx, 
— A) es T Jc 


which we can think of as an integral along a line in the complex plane as 


1 L 2 ul 2 
p(t =e th | == 07 ( x—it 2 dx=et ae eT ee a 
( ) xeER V2n 3 ) R-it V2T 


—z?/2 


We'll now again appeal to complex analysis by integrating this normal density rExe once counterclockwise around 


a rectangle (as shown below) as R —> oo: 
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== 75 aoe 


Since is holomorphic everywhere, the integral around this closed loop is 0, and we can check that the 


A 9-2/2 
Wer 
integrals along Il and IV go to zero as R — oo. Thus, the integrals along the real line IR and the shifted real line 
R — /t are the same, and they're both equal to 1 because we're integrating a probability density. Plugging this in, we 
find that ¢(t) = e-t’/2_ This is an important example because (ignoring factors of 27) the Fourier transform of a 
Gaussian is a Gaussian — in other words, the Gaussian is a sort of eigenvector for the Fourier transform. 


To finish today’s lecture, we'll do a weak form of the L? isometry (basically proving a simple case of Theorem 107): 


Proposition 113 


Suppose f,h € L'N L2.ML@™ are continuous functions. Then (f, A OTA). 


Proof. We will assume the part of Theorem 107 that if f € L'M L?, then FELZNL&® (this comes from Fourier 


analysis). That means that f and hare in L?, so by Cauchy-Schwarz fh is integrable. By the dominated convergence 


(f, hy = tim, F(t)A(t) exp (-=) dt. 


Expanding out the definitions of the Fourier transforms, we have 


(f, h) a lim, Ge c*F (xe) ([- "Yay ) exp (-=) dt. 


By Fubini’s theorem, we can now swap the order of integration to get 


= lim if [. roomy) elf —Y) exp oo dtdxdy 
= = tim I. roomy | elty) eae ao ( =) dtdxdy. 


sae exp (-§) is the density of wa where Z is standard Gaussian, the blue inner integral is Ele—-)-2/vé] = 


theorem, we can therefore write 


(f, A 


a 


Since 


e- -¥/(2€) This leaves us with 


(f, A) = im, f i room| on ee YP /(28) dxdy, 
e>0 


(So we start off with a “spread out Gaussian” in the Fourier space, which has now become a very “peaked” Gaussian 


in regular space.) Taking the limit « > 0, the e—¥)"/(2e) will shift all of the weight of integral to the region around 


xX = y. Thus by continuity of f and h, we have 


(f, A) = an | f(x)h(x)dx = 21 (f, h), 


as desired. 
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14 October 28, 2019 


(Our midterm exams will be brought to class on Wednesday.) Last time, we discussed the Fourier inversion theorem for 
probability measures on R, which lets us recover the measure yu for a random variable X given its characteristic function. 


£, 


(One key example was that the characteristic function for the standard Gaussian measure is o(t) = exp (-$). which 
will come up again.) The main goal of today will be to prove the central limit theorem for iid sequences. Basically, we'll 
show that if X,X; are iid with E[X] = 0 and E[X?] = 1, then Ti parr. = N(0, 1) for some sense of convergence 


(“in distribution”) which we'll now formally define. 


Definition 114 


Let S be a metric space with Borel sigma-algebra B generated by the open sets in the metric topology, and 


let Un, LW be probability measures on (S,B). We say that “4, converges weakly to yu (written as uw, => 4), if 


if fditn > yi f du for all bounded continuous functions f : S > R. If X,,X are random variables, then we say 
that X, LG converges in distribution if Lx, = > Lx. 


To understand why we require f to be continuous, consider the (deterministic) random variables X, = 4. Then 
X, converge to X = 0 almost surely (so it also makes sense to have them converge in distribution), but the function 


f(x) = 1{x > 0} does not satisfy F(Xp) > F(X). 


Fact 115 
Every probability measure on a metric space (S, B) is regular, which means that for any A € B, we can approximate 


L(A) from above and below: 
u(A) = sup{u(F): F CA, F closed} = inf{u(G) : GD A,G open}. 


The proof in general is similar to the one we did on our homework for (IR, Bp). In particular, this implies that the 


measure pu is determined by {u(F) : F closed}. 


We can thus make the following claim: 


Lemma 116 


Suppose we have two measures yu and v such that f fd = f fdv for all bounded continuous functions f : S > R. 


Then w= v. 


Proof. For any closed set F C S, we can approximate its indicator function with a continuous function, namely 
dist(x, F 
f (x) = max (0 1- =) 


In other words, f(x) = 1 if x € F, f(x) =0 if x is more than € away from F, and we linearly interpolate in between. 


If we let F® denote the e-neighborhood of F, we then have that for all ¢ > 0, 


lp <f<1pe => | UF) < f tau | fav < y(FF)|, 


where the middle equality holds by assumption because f is a bounded continuous function. Since F is a closed set, 


F& | Fasel0,so v(F®) | v(F) > u(F) by continuity from above. Repeating the argument in the other direction, 


we see that u(F) = v(F). Since this holds for all closed sets, we find that 4 = v by Fact 115, as desired. 
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Corollary 117 


A sequence of measures 4, Cannot converge weakly to two different limits. 


With that fact, we can now think about how to study the space of probability measures. Let yz be any probability 


measure on S, and for any € > 0 and (finitely many) bounded continuous functions f,,--- , {, define the set 


Ue, f,,.4, (LL) = {v probability measures on S : / fidv — / dal <eforalll<i< Kb 
to be a kind of neighborhood around the measure fz. We can check the following fact from the definition: 


Proposition 118 
Let P be the space of probability measures on (S, B), and let J be the topology on P generated by the neighbor- 
hoods Ug, ¢,....,4,(44). Then weak convergence (as in Definition 114) is equivalent to convergence in the topology 


vie 


Because we have neighborhoods around measures, it’s natural to also define a distance between measures: 


Definition 119 


The Prohorov measure is defined via 
T(w, Vv) = inf fe > 0: w(A) < V(A®) +€ and V(A) < n(A®) +e VAEe B} 


for any 4,” € P, where A® denotes the e-neighborhood of the set A in the metric space S. 


Fact 120 
If S is a Polish space, meaning that it is complete (Cauchy sequences converge) and separable (it has a countable 


dense subset), then weak convergence of measures is equivalent to convergence in m-measure. 


In particular, if S is complete and separable, then P is also complete and separable and itself satisfies the conditions 
above. We often want to pick a random probability measure from the set of probability measures, so having the Polish 
space assumption is nice — this is why almost all of probability happens on Polish spaces, and we'll always be in this 
setting for this class. 

With this, we can turn to the central limit theorem. In general, the main strategy for showing convergence in 


distribution is to break the proof into two parts: 


* First, show that the family of measures tin is confined in a compact subset of P, which implies that there are 


subsequences of tu, that converge. (This can usually be done with rough estimates.) 


* Once we know that subsequences converge, leverage that fact to show that all subsequential limits coincide. 


This might be a bit abstract, so it’s important for us to understand what compact spaces in P actually look like: 


Definition 121 
Let {u.:a¢€/} CP bea set of probability measures. We say that {tq} is tight if for all € > 0, there exists a 


compact set K, C S such that Ua(K-) > 1-—€« for alla e€ /. 


66 


Tightness of a set of measures basically means that the mass is mostly concentrated within a compact set — two 
example families of probability measures that are not tight are {v, = N(0,1n)} and {7, = N(n, 1)} (because significant 
mass can be arbitrarily far away from the origin). It turns out that tightness and compactness are essentially equivalent 


in the following way: 


Theorem 122 (Prohorov) 


Let [1g be probability measures on a metric space (S, B). If {4a} is tight, then the family of measures is relatively 


compact in P (that is, it has compact closure). Furthermore, if S is a Polish space, then the converse also holds 


(though this is less useful). 


It turns out the case S = R is easier to prove, and this leads to the Helly selection theorem. We should make sure 
we understand the statement of Prohorov'’s theorem, and we're responsible for reading and understanding the proof 
of the Helly selection theorem. (This can be found in the course textbook, but the main idea is to take the family of 
cumulative distribution functions F,, repeatedly take subsequences of this family that converge at each rational to get 
limits at each gq € Q, and then define a function F on R by making F(x) the infimum of those limits across all q > x. 
This function will be increasing and right-continuous but not necessarily a distribution function because the left and 
right limits may not be O and 1, but they will be if and only if the measures are tight.) 

Turning now to the central limit theorem, the key to the proof will be to relate weak convergence to characteristic 


functions: 


Lemma 123 


Suppose p is a probability measure on R with characteristic function ¢@. Then for any u € R, we have 


\ ({xe R: |x| > 2\) <i f(a a(e)at 


In other words, the tail behavior of the measure pu is related to the characteristic function's values near 0. 


Proof. Writing out the definition of the characteristic function and using Tonelli’s theorem to change the order of 


integration, we have 


=f @-o(eat= iff G-e™aneoae= ff Za- e™dtdutx). 


The inner integral can be directly computed, and we end up with 


= i 2 (: Z _) du(x) = [2 € 7 an) du(x). 


The integrand is always nonnegative, so restricting the domain to the region where |ux| > 2 only makes it smaller. 


Additionally, because sin(ux) < 1, the integrand here is at most 1 whenever |ux| > 2. Thus we indeed have 


a — O(t))dt > = 1du(x) = u({x: |ux] > 2}), 


as desired. 


From this lemma, we can get the following important result: 
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Theorem 124 (Continuity theorem) 


Let f4, be probability measures on R with characteristic functions @,. Then 


* If un Converge weakly to some py, then @, converge pointwise to the characteristic function @ of j. 


+ Conversely, if 6, converge pointwise to ¢, and ¢ is continuous at t = O, then @ is a valid characteristic 


function of some probability measure 4 with Un, => LL. 


co 


Proof. For the forward direction, if u, => py, then pp(t) = f° e'*dun(x) is the integral of the bounded continuous 


—oo 
function f(x) = e', so @,(t) + @(t) converges pointwise by the definition of weak convergence. 

For the reverse direction, start with the equation uy ({|x| > 2}) < 4 ["(1 — bn(t))dt from Lemma 123. Fix u 
and take n — co; by the bounded convergence theorem (and the assumption that ¢, > @ pointwise), the right-hand 
side converges to +f" (1 — $(t)). 

Additionally, since @ is continuous at t = 0 and ¢(0) = 1, this integral converges to 0 as u + 0. This shows 
that the Up are tight — indeed, f(u) = liMSUP, 4.6 Mn (|x| > 2) goes to 0 as u | 0, so for any € > O we can take 
an appropriately small u such that f(u) < €. That implies that jw, ({Ix| > 2}) < e€ for all but finitely many n, and 
then we can take the union of this compact set with the finitely many compact sets that include at least 1 — e€ of the 
measures at the beginning of our sequence. 

Thus, np weakly converges along subsequences by Prohorov's theorem, so for any subsequence where we have weak 
convergence fn, —> Vv, we must also have Pun, — dy (by the forward direction argument). But we have ¢, = @ for 
any subsequence (because the whole sequence ¢, converges too @ by assumption), so because ¢ is the limit of such a 
weak convergence, It is a valid characteristic function. This argument applies for any subsequence, so all subsequential 


limits exist and are identical. Therefore every subsequence of 4, has a further subsequence converging to 4, and (as 


a general property of topological spaces) this implies that 4, converges to uz as desired. 


Example 125 (A non-example) 


Consider the wide Gaussians up, = N(0, n) from before (which are not tight). Then @,(t) = exp (=) converges 


to the indicator function 1{t = 0}, which is not continuous. Indeed, the measures 4, do not converge weakly to 


any probability measure [. 


Remark 126 (Joke). /n response to “Does this imply CLT?”, the answer was ‘! don’t know, does it?” 


Thus, the continuity theorem tells us that all we need is to show that the characteristic functions of the left and 
right side are the same in our central limit theorem statement. We can check the following fact with a calculus bash 


(which can also be found in the textbook): 


Lemma 127 


For all x € R, eX — (1+ ix—¥)| < min (x 


We're now ready to prove two versions of the central limit theorem: 
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Theorem 128 (Central limit theorem for lid sequences) 


Let X, X; be iid random variables with E[X] = 0, and E[X?] = 1. (These values can always be rescaled, but we 


do require a finite second moment, which is stronger than the assumption for the strong law of large numbers. ) 
Then 


1 n d 
— )> x) + NO, 1). 
HO ex 


Proof. By linearity of expectation, we know that 


: 2x2 2 
em (: + itX < )| = x(t) -1- 5. 


But plugging in tx into both sides of Lemma 127, taking expectations (which is allowed because both sides are 


nonnegative random variables), and then dividing by t? yields 


$x(t) - (: = 5) <E [min (X*, 6|X/")]. 


The expression min (X?, £|X|3) converges to zero pointwise almost surely as t > 0 (because £|X|° goes to zero), and 
it is dominated by the integrable X*. Thus, the right-hand side goes to zero as t — 0 by the dominated convergence 


theorem, which implies that 


dt) =1= - ott). 


From here, the random variable Z, = Fi yoy, Xi has characteristic function 


62,(0) = ox (4) = 0-5, | (5). 


For any fixed t, this converges to exp (-5) as m —> co (we can check this by taking a log first and using L’Hopital’s 


rule), which is the characteristic function of the standard Gaussian. Thus $z, + $yo,1) pointwise, so Theorem 124 


yields the desired result. 


Remark 129. /f we had tried to produce a similar proof with the moment generating function instead of the charac- 
teristic function, we would run into more problems. In particular, it’s possible to have finite second moment and not 
have the moment generating function be defined anywhere but t = 0, and we also don’t have such a nice inversion 


theorem in that case. 


Finally, we'll prove a more general central limit theorem: 


Theorem 130 (Lindeberg-Feller central limit theorem) 
Suppose that we have a triangular array of random variables such that for each n, {Xn : 1 <j < n} are mutually 


independent with mean O but not necessarily lid. Also suppose that the functions are normalized such that 


Var(Xn1 +++: +Xnn) = do) i [X?2 |] converges to some 07 > 0 as n - 00, and (here is the key assumption) 


ail 
for all € > 0, 


n 


lim 01 Xe eo. 


J=1 


Then S, = >>j_1 Xn converges in distribution to N(0, 07). 
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Theorem 128 is a special case of this — if X; are our iid random variables, then we can plug in X,;j; = a and 
check that the conditions hold. The reason we don’t have any 4/n factor here is that we are assuming we have already 
done the normalization for the variance to work out, and the key condition says intuitively that each individual variable 


cannot contribute significantly to the variance. 


Proof. We'll show again that the characteristic function of S, converges to that of a Gaussian. Let $n; be the 


characteristic function for X,,; and let oj = o(X? ;). By the same inequality as in the previous proof (but this time 


42 2 ie 
ont) (1-4) | <x [min (x4, “4)), 


The second term is smaller when X,,j; Is small, and otherwise the first term can be bounded with the theorem 


with variance 0? pp we have 


assumption. So using one or the other inequality depending on whether |X,,,;| is larger than €, we get the bound 


2 42 
Paglt) = (: = uf) | or 


for any € (where the first term comes from the |X|? term and the second from the X? term). We can now use the 


teE(X; ,) 
6 


+ u(X? 5: IXnjl > o| 


fact that for any z;,w; € C with modulus at most 1, [[z — II wi| < S- |z; — w;| | (by writing a telescoping sum 


where we switch from z; to w;, then Z. to w2, and so on). The characteristic function ¢,,;(t) always has modulus at 


most 1, and 


a7.= u(Xp p= (X55) Xnul < €) + 16, nj ‘Pagl2 epee + a(X mpi eng| =e) 


nd 


can be made arbitrarily small by taking € > 0 and n— oo (by our key assumption). Thus, for sufficiently large n and 


2 42 
using sufficiently small €, the term =e | is always at most 1. So applying the boxed identity above, we find that 


) <¥|*|* sea D4 nee 


For any fixed t, both terms tend to 0 as € + 0 and n - on (the first term does so because the )> 


[oi = II (:-% = 


2 
X;,j] converges 


nJ 
to a finite value and then we multiply by €, and the second term does so by our key assumption), so the left-hand side 


converges to zero as well. In other words, for any t, 


oF je 
dim $s, = lim 2 Tou ast I(:- | 


We've shown that oF - can be made arbitrarily small for large enough n, which is enough to make the product on the 
—to?/2 


right-hand side ees toe (again this can be checked by taking logs on both sides). This is the characteristic 


function of the Gaussian N(0, a?) as desired, so Theorem 124 again yields the desired result. 


15 October 30, 2019 


Office hours will run from 4-6 pm today, and we can come to see our midterms (for logistical reasons, we won't get 
them back during class). Our grades are on Stellar already — the average is fairly low and the distribution is very 


interesting. The score is out of 40, and when we see our score, we should take it and divide it by 40, and then multiply 
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it by 200 to think about how well we did. If we scored a 10 out of 40, this is problematic (and we should talk to 
Professor Sun about whether we should stay in the class), but otherwise we should just interpret the score as how well 
we're doing. Moving forward, the class won't change much — the pace and difficulty will be about the same as it has 
been so far, but maybe the problem with the first exam is that we ran out of time. So there might be a time arranged 
outside of class for us to take the exam (at night, for longer than an hour and a half). 

Last time, we talked about the topology of weak convergence (for probability measures on a metric space). Notably, 
we discussed Prohorov's theorem, which was used to prove the Lindeberg-Feller central limit theorem. This last result 


essentially states that if we have a triangular array of independent mean-zero random variables X,,;, and we know that 


\Xnj| > €] goes 


ya E[X? |] converges to a finite value o° and that the contribution from large values aa B[X? 5; 


to zero as n — oo, then the row sums ya Xn, converge in distribution to N(0, 07). Let's see an example of this in 


action: 


Example 131 


Let m be a uniformly random permutation on [n] = {1,--- ,n} (there are n! such possibilities). We'll study the 


behavior of the number of cycles in 7 as n grows large. 


We'll write the permutations in (sorted) cycle notation. For example, having m = (136)(2975)(48) indicates that 
1 maps to 3 maps to 6 maps to 1, and so on. To ensure uniqueness of representation, we make sure that the smallest 
number in each cycle appears at the beginning, and we sort the cycles from left to right by minimum element (as we've 
done above). This is useful because we can now sample 7 sequentially from left to right: start by writing down “(1”, 
and then pick a random integer (uniformly from 1 to n) for 1 to go to, say 3. (If it’s 1, we instead immediately close 
the parentheses.) Next, pick another random integer that is not 3 (for example 6) — if it’s 1, close the parentheses, 
and otherwise, pick another random integer which is not 3 or 6. Once we return to 1, we close that cycle and start 
again with the next smallest integer not yet picked, ignoring all of the numbers already in a determined cycle. 


This sampling method is convenient, because we can now define the random variables 
Ink = 1{a right parenthesis appears after the Ath number in sorted cycle notation}. 


(For example, /;,x takes on the values (0,0, 1,0,0,0,1,0,1) in our example permutation 7.) Since the number of right 
parentheses is the same as the number of cycles, our goal is to determine the behavior of S, = yo 4 In.k- And because 
of the way we sample, the /,,, are actually independent Bernoulli variables of parameter ay because there are 
(k — 1) choices that the Ath number cannot go to, and out of the remaining n — (k — 1) choices, the probability of 
having a right parenthesis is the probability that we choose the current first entry of the cycle. Additionally, we can 

1 1 


see that any permutation will be sampled with probability 4 eee dee 


— so this does indeed yield a uniformly 


random permutation of [n]. 
Thus, we're in a setting to apply the Lindeberg-Feller central limit theorem to our random variables /,,%. The 
expected number of cycles in 7 is 


n n 


1 1 
[Sil = DY aR RT) ae =logn+O(1), 


k=1 j=l 


and the variance is (using that a Ber(p) random variable has variance p(1 — p)) 


n 


° i 1 ae | 
Var(S,) = Var(/n. xg) = -{1——])=E[S,| = =logn+O(1). 
» N= 205 ( i) 27 


j=l 


#1 


To apply Lindeberg-Feller, we recenter and rescale our random variables by defining the random variables 


Ink = [lnk] 
Jlogn — 


This rescaling makes the first condition of Lindeberg-Feller hold (the sum of the variances of the Xp,x is egar Var(Sn), 


Xn,k = 


which goes to 1 as n > co). The second condition is pretty easy to check too, because each term of the expression 


> (tee) > 


is just equal to O for all n large enough (the numerator /,% — Ely, is at most 1, so multiplying it by Jloon makes 


Ink—EU nk] 
Vlogn 


Ink _ E[/n,kl 


Jlogn 


smaller than € almost surely for sufficiently large n). Thus Lindeberg-Feller applies, and we get the result 


n = 
Ink ~ Elna 
Se N(O, 1). 
» VJVlogn seo) 


Substituting in Sp = dop_4 Ine and Sop_, El/n.«] = logn+ O(1), we thus find that 


S,—logn a 
———— > N(0,1) |, 
Vlogn = NS) 


and we've seen an example in action where the central limit theorem can be applied even though the random variables 


are not identically distributed. 


However, there are some settings in which sums of random variables do not converge to a Gaussian: 


Example 132 


Suppose the random variables /,,~ are all distributed according to Ber(4) for some constant A, and we want to 


study the behavior of Sp = S>y_, In,« as n grows large. 


This time, S, =~ Bin (n, A) has expectation by linearity of expectation, so the probability that S, is extremely 
large (like 100A) is pretty small. So because most of the mass is supported on a constant range (A is a constant not 
depending on n), but S, is only supported on the nonnegative integers, it doesn’t really make sense to expect that the 


distribution will approach that of a Gaussian. We can carry out the calculations to make this more clear: notice that 


PS, =k) = @ (*) (: “yn = . oe (: a 


where (1), denotes the falling factorial cage If we fix k and take n > oo, the middle term goes to 1, and the —k in 


the right term's exponent becomes irrelevant, and thus 


k 
lim P(S, =k) = ae 


n-+00 kl 


So in this case, S, actually converges in law to the Poisson distribution Pois(A). Again, this is supported on the 


integers, so it's definitely not Gaussian. However, if we take large enough, we do recover the Gaussian limit — in 


other words, aa a N(0,1) as X + co. Another way of saying this is that if we consider the random variable 
Bin(n,p)—np : : ’ : ‘ : : best 
— SS , which ts a random variable with mean 0 and variance 1, then (by direct calculation or by the central limit 
/no(1—p) (by : 


theorem), it will converge to a standard normal if we take p fixed and n large, but it will converge to a Poisson 


distribution if we take p = a and n large. Let's verify that we do get the correct limit behavior as we take A — oo: 


#2 


Proposition 133 


The Poisson distribution Pois(A) converges in distribution to a Gaussian as A > oo. 


Proof. We'll make use of characteristic functions. If Y, ~ Pois(A), then the characteristic function of Y is 


itk s—dA\k it)k 
g(t) =E [ee = > a = ery” ee = exp(—A(1 — e"*)). 


k>0 k>0 


Since Y has mean and variance X, we can define a normalized version Z, = ae. The characteristic function for Z 


w(t)=E lew (x (3~))| = exp(—itVA)E ea ; 


Plugging this into the form of @ above, we thus find that 


is then 


w(t) = exp (- (1 — eft) — itv) ; 


But as A gets large, we can do a series expansion, and we'll find that 


v(2) = ex ( a( - - o(s52)) itv). 


which converges pointwise to e-t’/2 as } 4 oo. Thus, the usual continuity theorem tells us that the rescaled Poisson 


random variables do indeed converge to a standard normal. 


Professor Sun once taught a class similar to 18.600, so she feels a little silly going over this next topic, but this 


next part may be useful because apparently many of us haven't taken an intro probability class before. 


Example 134 


We're going to review all of the standard probability distributions. Consider Bernoulli trials, which means that 


we have random variables /, distributed iid according to Ber(p) for all k > 1. 


We'll say that /, = 1 (which occurs with probability p) is a “success” and /, = 0 (which occurs with probability 
1 — p) is a “failure.” Then the number of successes Bm = > >y—1 /x after time m (that is, m trials) is distributed 


according to the binomial distribution Bin(m, p), with 


m 


P(Bm = k) = (7) "a _ pym-k 


for any 0 < k < m. Meanwhile, the time G of the first success, or equivalently the first index k such that /, = 1, is 


distributed according to the geometric distribution Geo(p), with 
P(G =k) = (1—p)*'p. 


As an extension of this, the time of the rth success X;, Is distributed as X, a G, +---+G, where G, G; are iid Geo(p) 
random variables. (In other words, the negative binomial distribution is the geometric distribution convolved with itself 


r times.) X; is then distributed according to the negative binomial distribution NegBin(r, p), with 


P(X, = 1) = (F7 1) = a)" 
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for any t > r (because we need exactly (r — 1) successes in the first (t — 1) trials, followed by a success). 
Next, suppose that we scale p = i, so that the vast majority of our experiments are failures. If we also scale time by 
n (so that we can see successes at a reasonable rate instead of very rarely), then we'll have done nt trials by time t. Then 
as we calculated in Example 132, Byz¢ ~ Bin (nt, +) s Pois(t) converges to the Poisson distribution. Meanwhile, the 
time E of the first success converges to a continuous random variable +Geo (4) A Exp, the exponential distribution, 
which has density 
fe(t) =1{t > O}e* 


(we can prove this with characteristic functions or direct calculations). The time of the rth success then also converges 
to Stee ~ +NegBin (r, +) x F,+---+E&,, where the E&; are tid exponential. This is the exponential law convolved 
with itself r times, which gives us the Gamma distribution Gamma(r). The gamma density is good to remember, 


and it can be derived by taking the limit of the negative binomial distribution: 


v(t) ctv (5) (4-3) (2) tae EB (1-3) 


If we now fix r and z and take n > ov, we find that 


P (“ene =i nes zi-le-z ti-le-t 


- + dz] = (oD dz => fe,+..+6,(t) = {t= VES pr 


For general real numbers a > 0 (not necessarily an integer), the Gamma(q) distribution similarly has density 1{t > 
os, where [(q@) is the normalizing constant. (In particular, [ is a generalization of the factorial, because 
T(r) = (r — 1)! for positive integer r.) And the Gamma distribution will approach a Gaussian if the amount of time 


passed is large relative to our sampling rate — in other words, we converge to a Gaussian as @ > oo. 


Fact 135 

We showed in Example 131 that the number of cycles S, of a random permutation of [n] satisfies nee > 
N(O, 1). If we let C,,~ be the number of cycles in 7 of length k, then it turns out that the tuple (Cp, Cn2,°-: » Cnn) 
converges in distribution to (Yi)«>1, where Yx ~ Pois(¢). The proof is a tiny bit outside the scope of what we've 
done so far, and we can see Arratia and Tavaré’s paper [3] for more details. (The main idea is that a large number 
of trials for a rare event is approximately Poisson, and the events that two different vertices are both in cycles 


of length k are close to independent.) For now, we can at least check the expectations E[C, ,] = i because 


there are (combinatorially) = possible cycles of length k, and each one occurs with probability or (this can be 


checked by exploring the cycle one element at a time). 


With the remaining time in this lecture, we'll discuss some weak convergence concepts, with a reminder of what 
results and proofs we should know for this class in general. For what follows, let S be a complete separable metric 
space (that is, a Polish space) with Borel sigma-algebra B. The space of probability measures P on S is then also a 


Polish space. This next result gives us useful alternative characterizations of weak convergence: 
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Theorem 136 (Portmanteau theorem) 


Let [4, be a sequence of probability measures. Then the following are equivalent: 


* Un => Lb. 


- f fdun + J fdp for all bounded uniformly continuous f. 


limsup Ln(F) < u(F) for all closed F C S. 
liminf n(G) > u(G) for all open G C S. 


lim pon(A) = EA) for all A with (OA) = 0. 


There is also a way to convert weak convergence of measures into a statement about convergence of random 


variables: 


Theorem 137 (Skorohod) 
If 4, => wu, then there exists a probability space (Q, F,P) and measurable mappings X,, X : QQ — S such that 


Lx, = Un, Lx = bs, and Xp, converges to X almost surely under P. 


We should know the statements of these proofs and of Prohorov’s theorem (Theorem 122), and we should know 
a bit more when we're looking at the special case under the real line (S = R). Let’s do a bit of review for that: if 
we have a set of measures on the real line, then each 4, can be represented by its F,, and w can be represented with 
its cdf F. We then say that F, => F if F,(x) > F(x) at all points of continuity of F — the portmanteau theorem 
then implies that 4, = > w if and only if F, = > F. From there, tightness in R is easy to characterize, because the 
definition is equivalent to requiring that for any € > 0, inf,{F,»(x) — F,(—x)} > 1— for some sufficiently large x. 
So showing Prohorov’s theorem for IR means that we just need to extract subsequences for our F,s which don’t have 
mass escaping from the side, which follows by a diagonalization argument (the Helly selection theorem). 

Skorohod’s theorem is somewhat abstract in general, but (just like Prohorov's theorem) it’s simpler over R, and it 


can also be very useful: 


Proof of Theorem 137 for S =R. Represent with its cdf F — first suppose that F is one-to-one. In this case, let U 
be uniform on [0, 1]; we claim that F~1(U) ~ w. Indeed, for any x € R, we have 


P(F~*(U) < x) = P(U S F(x)) = F(x) 


(where the first equality comes from F being monotone and one-to-one, and the second comes from F(x) being 
between 0 and 1). More generally, if F is not one-to-one, define the function X(u) = inf{y : F(y) => u} in place 
of F~!(u). Then notice that we have X(u) < x if and only if F(x) > u — the reverse direction is clear, and for the 
forward direction, X(u) < x implies that there are y < x +e such that F(y) > u for all € > 0, so F(x) > u by 
right-continuity of F. Thus we again have 


P(X(U) < x) = P(F(x) 2 U) = F(x). 


Basically, given any cdf F, we can define a “rough inverse” of it using this mapping X, so we can construct the desired 
random variables as follows. Let the probability space (Q,F,P) be ([0,1],B,Leb), and define the random variables 
X, Xn. QR via 

X(w) = inf{x: F(x) >w}, Xp(w) = inf{x : Fp(x) > wh. 


15 


Weak convergence then implies that F, converges to F pointwise where F is discontinuous (which only happens at 
countably many points because F is monotone) points of discontinuity. So for all w where X is continuous, Xp, 
converges to X. (Here's one way to show that: if X is continuous at X(w) = x, then for all € > 0 there is some 6 > 0 
such that for all w’ € (w — 36,w + 36), X(w’) € (x — §,x + §). Picking some eé’ € (5, €) such that F is continuous at 
both x — e’ and x + €’, we must have F(x — €’) < w — 26 and F(x +’) > w + 26 (by definition of X and that F is 


monotone). Because F, converges to F pointwise at x +e’, for all sufficiently large n we thus have F,,(x —€’) <w—6 
and F,(x + €’) > w+6, meaning X,(w) € (x —€,x + ¢€) because e’ < €. Taking € — 0 shows the result.) Since X 


itself is also monotone, the discontinuity points are of measure zero, and thus X, converges to X almost surely. Since 


we've already shown that the laws of X, and X are 4, and w respectively, we've proven the desired result. 


(The proof of Skorohod’s theorem in general is kind of related to what we've done here, but it is more complicated 


because we don't have the simple mapping using F anymore.) 


16 November 4, 2019 


Class started with another attendance quiz today: 


Problem 138 


Let X, Xj; be iid exponential random variables of density 1{x > O}e-*dx. Find a sequence b, and a random 


variable Y such that (maxi<j<n Xj) — bp ay 


Solution. We can write the distribution function for the maximum of independent random variables in terms of the 


distribution functions of the individual variables: specifically, 


P ( max X; < t) =P(X < t)/"=(1-e°)". 


1<i<n 


Substituting in t = logn+ x to eliminate the dependence of n on the right-hand side, we have 


P ( max X; < logn +x) = (1—exp(—(logn+x)))”, 


1<i<n 


which converges to exp(—e-*) as n — co. Thus, we can take |b, =logn|, and Y = maxX; — b, converges in 


distribution to a random variable Y with distribution function | P(Y < y) = e ©” | which is called the Gumbel distri- 


bution. 


Remark 139. The terms convergence in law and weakly convergent mean the same thing. 


The problem above is an example of an extreme value statistic, where we want to study the largest of a bunch 


of samples. Let’s do another example of this type: 


Example 140 
Let Z,Z; ~ N(0,1) be standard Gaussians with density g(t) = Jaq eXP (-%) dz. We wish to study the 


OT 
distribution of max(Z1,--- , Zp). 
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Solution. To use a similar strategy as the previous problem, we need to similarly study P(Z < x) but for a standard 


Gaussian. This time, the function 


W(x) =P(Z>x)= a sae (-3) dz 


has no closed form expression, but we can get a nice lower bound for it when x is large: 


is 7 Sx) iL ia ( =) 
U(x) < a zexp | —— | dz, 
(x) < - mel: P(-> 


which we can integrate with a u-substitution u = = to find that | V(x) < 


1 a g(x) a 
exp { —— } = —— |. This ts onl 
V 20x e ( 2 ) x 


an upper bound, and it’s not very useful for small x. But we can show that It’s a pretty tight bound for large x — write 


V2nV (x) = i. - -—Zexp (-5) dz, 


x 


and now integrate by parts (with u = + and dv = —zexp (-$)) to simplify to 


oo co 1 2 
dz. 
Lf xe0(-2) 


The first term evaluates to 4 exp (-=), and the second can be integrated again by parts — letting u = 4 and 


VoRW(x) = : ov ( =) 


dv = zexp (-4), we have 


i x? 1 2)" © 3 Zz 
ViFU(x) = 000 ( ) |-Z 00 ( ai ff 300 ( =) dz. 


We want a lower bound, so we can toss the last term because the integrand is always positive. We thus get the bound 


VIEW (x) > = exp ( *) 00 ( ~) = vo) 2 9 (1-3) 


This implies that for x large and for some t € R with |t| < x, we have the ratio 


1 (x+t)? 
Wxt es) ~ Men) lis ( @ ~ exp | —xt — et 
W(x) ae ee P 2) 
a ra ee 


and the leading order behavior is proportional to e~ — more specifically, we can change variables t 4 £ to get 


WV (x + 4) 


x 


im Wo) = exp(—t). 


With this, we can go back to our original question: we have Zjs that are iid standard Gaussians, and we want to 
understand the distribution of M, = max Z;. Just like in Problem 138, 


P(My < x) =(1— V(x)", 


and we want to pick W(x) to be approximately + to again get rid of the n-dependence. So choose x = by, such that 


this is true (this is a deterministic real number for each n, because V is monotone). We then have, as by — oo, 


P( My < bn + 5] = (1-v(o+2)) - (1- aaa) 


a 


by our calculation above, and just like before this converges to e-©' as n> co! So the only difference from last time 


is that our variables need to be rescaled: 


(nex Zi - bn by 4, Gumbel |. 


1<i<n 


Remark 141. Not all random variables behave in this way — the result depends on the tail behavior of the distribution 
for X — but there is a large class of variables for which we get convergence to the Gumbel distribution. (And that’s 


why Gumbel is an extreme value statistic.) 


We can also say a bit more about the values of b, we are picking for the standard Gaussian — if we want V(b,) = 2, 


then the rough behavior we’re looking for is that 


1 g(bn)_ ( =) 
= = exp ; 
n Dn Vv 27 by 2 


Based on just this equation, it looks like we want b2 = 2logn + (lower order terms) — the next correction should 


account for the b, in the numerator. Taking logs on both sides of the preivous equation, we find that 


b2 1 
3 = logn— = loglogn + O(1) => b, = V2egn (1 


loglogn+ O(1) 
Alogn i 


And the estimate of W(b,) © ee is good enough, because adjusting by the factor of (1 — é) only changes b, by a 


multiplicative factor of (1 — our). 

We'll do some more examples of weak convergence on our homework — it’s useful in general to be able to show 
convergence of a sequence of random variables, so there will definitely be something about that on our next exam. 
But for now, we'll move on and spend a bit of time talking about weak convergence on R®. (We should also read 


section 3.10 in the textbook, but we'll go over some of the main points here.) 


Definition 142 
Let w be a probability measure on R?. The generalized cdf of yu is defined by F(x) = uw (T1L:(-ce, xi) for 


xX = (x1,--+,Xa) € RY, and the characteristic function of yw is defined by x,(t) = f exp (i(t, x)) du(x) for 
te R?. 


We showed on our homework that F is monotone and right-continuous, and we also found that the measure of 


any d-dimensional rectangle can be found by an inclusion-exclusion argument as 


d 
m (Ie o = DR tr FV), 


veV 


where V is the set of vertices {a1, bi} x --- x {ag, bg}. Then it turns out that 4, = > wu (with the general definition 
from Definition 114) if and only if F, converges weakly to F, meaning that F,(x) — F(x) for all x where F is 
continuous. And for the characteristic function, notice that the integrand still takes values on the unit circle, so x,,(t) 
still takes values inside the unit disk. We do have a Fourier inversion theorem in higher dimensions as well, but it's 


easier to state if we assume the boundary has zero measure: 
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Theorem 143 (Fourier inversion for probability measures on R) 


Let uz be a probability measure on R%, and suppose we have A = eder bi] with w(OA) = 0. Then 


en ita a) 


d 
w(A)= tim OT] = 


So again knowing @ determines fz even in higher dimensions (because we can find a dense set of reals to pick 
our endpoints a;, b; from such that u(OA) is always zero). With this, the one-dimensional continuity theorem also 


generalizes directly with a similar proof (showing tightness): 


Theorem 144 (Continuity theorem in higher dimensions) 


Let 4, be probability measures on R®@ with characteristic functions @,. Then 
* If Un => yp, then ¢, converge pointwise to dy. 


* If dy converge pointwise to @ and @ is continuous at t = 0, then @ is the characteristic function of some tu 


and Un => bw. 


Next, we can relate convergence in distribution in higher dimensions to that in one dimension: 


Theorem 145 (Cramér—Wold) 


If X,,X are R¢-valued random variables, then X, ig X if and only if the one-dimensional distributions converge, 


meaning that (6, X;) 4 (0, X) for all @€ R¢. 


Proof. First suppose we know that (0, X,) a (9, X) for all 6. Because f(x) = e is a bounded continuous function, 


we know that (applying the definition of convergence in distribution to the random variable (6, X;,)) 
Elexp(i(9, Xn)])] + Efexp(s(9, X))]. 


But now treating 6 as a free parameter, we see that the characteristic functions 6, of Xp, converge pointwise to the 
characteristic function ¢ of X, and Elexp(/(@, X))] is continuous in @ (by the same argument as Proposition 106), so 
by the continuity theorem we indeed have Xp, Be X, as desired. 

Similarly for the other direction, suppose X, 4X. The distribution of (8, X,) on R is the pushforward measure 
of iy under the mapping f(x) = (6, x), so the change of variables formula tells us that the characteristic function of 
(8, Xn) Is 

byoxxa(t) = f el a(uno ft) = f €8% dty 
But the right-hand side is the characteristic function of X, evaluated at tO, and we have pointwise convergence 
Xx,(t@) + xx(t@) for any t because X,, 4, x. Thus we also have pointwise convergence of the characteristic 
functions $i¢,x,) to that of dix), and thus (8, Xn) Z (0, X). 
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Corollary 146 (Central limit theorem for iid sequences in R?) 


Let X, X be iid R¢-valued random variables such that E(||X||?) is finite. Define the mean vector and covariance 


matrix of X as 


uw =E[X]€R%, L=Cov(X) =E[(X - p)(X —p)"] € R?, 


(Unpacking the notation, © = Cov(X) is the d x d matrix with entries equal to the covariances 2); = Cov(X;, Xj) 


between the different coordinates.) Then we have a Duin (Xj — L) ae N(0, x) (that is, convergence to the 


multivariate normal distribution). 


Proof. Much like in previous proofs, we're going to show that the characteristic functions converge, so let's start by 
figuring out the characteristic function for the multivariate Gaussian. Because > is a d x d positive semidefinite matrix, 
it has a Cholesky decomposition © = AA’ for some A € R?**. Then having Y ~ N(0,~) is equivalent to saying 
that Y £ AZ where Z is a standard multivariate Gaussian N(0, Ikx&) (by plugging into the definition of the covariance 


matrix). We thus find that the characteristic function for Y is 


ox (8) | = E [exp(i(@, Y))] = E[exp(i(8, AZ))] = E [exp (i(A’@, Z))] . 


But by definition, we know that Z = (Z1, Z2,--- , Z,) with all entries iid standard Gaussian, so the above expression 
simplifies to (here using that cyZ, +---+c,Z, has the same distribution as ace teeet+ cz for a standard Gaussian 


Z' and that the characteristic function for the standard Gaussian is E[e’t2’] = e~*’/2) 


: (Arar) = (—Leraarey) _ i =) | 


Our goal is to show that the left-hand side's characteristic function converges to this function d:(0@), meaning that it 
suffices to show that for all @ € IR? that 


z (v0 ( (° = 2% 7 »))) 4 gz (9). 


But we know by the one-dimensional central limit theorem that 


Oo 
x 


7 70.X; — 2) 4 N(0, Var((8, X))) = N(0, (8, £6) 


(the last equality can be verified by expanding out the formula for variance). Thus we can take the characterstic 


functions of those variables and find that for any fixed 0, 


y (v0 (' 210% »)) + exp ( a) ; 


and thus we have convergence of the characteristic functions and therefore convergence to the desired distribution. 


This concludes the content that we're covering from chapter 3 of Durrett, and now we're going to move on to our 


next topic, martingales and conditional expectation. We'll start with a motivating discussion: 
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Definition 147 
Suppose we're on a probability space (Q, F, P), and suppose we have two events A, B € F such that P(A) > 0. 
Then we can define the conditional probability P(B|A) = a. (In particular, P(B|A) = P(B) if A, B are 


independent.) If P(A) > 0 and X € L? is an integrable random variable, we can also define the conditional 
E(X;A) 


expectation E(X|A) = BAY 


Next, suppose X, Y are both random variables on (Q,F,P). For any y € Q with P(Y = y) > 0, we can define 


g(y) =E|X|Y =y] = a. Then if Y satisfies P(Y = y) > 0 for all y € supp Ly, then we can define the 


conditional expectation (random variable) E[X|Y] = g(Y). 


Because of the division by P(A) and P(Y = y) above, it doesn’t directly make sense to condition on events of 
probability zero. However, there are often situations where we do want to condition on an event of that sort. For 


example, suppose we have random variables distributed as 


(y)monr=a( (0) (moa) 
Y 0 O21 022 


where det X > 0. In general, a Gaussian of the form N(0, © = AA’) doesn't have a density on R@ unless A is actually 
an invertible d x d matrix. But if we assume that det 2 > 0 for the matrix above, then £:x,v) will be a measure on R? 
with some density gs (the details can be worked out using a calculus argument). But the point is that even though 


the probability that Y takes on any particular value is zero, we would still like to be able to say that 


_ Sxgx(x y)dx 
J 9(x, y)dx 


e[X|Y = y] 


So the ordinary notion of conditioning from an introductory probability class isn't quite enough, and we'll discuss this 


more next time! 


17 November 6, 2019 


Our second midterm exam will be of similar difficulty to the first one (and weighted equally). To give us more time to 
finish the exam, it will be scheduled between 5—9pm on December 4th, depending on our availability. 


Today, we'll discuss conditional expectation and martingales. Recall that if X is a random variable on (Q, F, P) and 
E[X;A] 
P(A) 


random variable with P(Y = y) positive, we can condition on the event Y = y and write E[X|Y = y] = ene 


i|X| < 00, then we can define E[X|A] = for any event A with positive probability. In particular, if Y is another 


However, this doesn’t really work as generally as we'd like, because we want to define P(X|Y) even when P(Y = y) = 0, 
particularly when Y is completely nonatomic. So the key idea is that it doesn't make sense to condition on events of 


measure zero, so we shouldn't think about individual events {Y = y}. Instead, we should think about Y as a whole 


and try to define E[X|Y] as a random variable. This random variable should be o(Y)-measurable (because it should 


depend only on the value of Y), and it should satisfy the key property that 


S[E[LX|Y]A(Y)] = E[Xh(Y)] 


for a function h, as long as both sides are well-defined. Let’s formalize this now: 
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Definition 148 
Suppose X is a random variable on (Q, F, P) with E[|X|] < oo, and let G be a sub-o-algebra of F. We say that Y 


is a version of E[X|G] if Y € G (that is, Y is measurable with respect to G) and E[X; A] = E[Y; A] for all AE G. 


It's an exercise for us to reconcile this definition with the “conditional probability” that we learned in introductory 
probability. We need to show that this random variable Y actually exists and that it’s unique in some way. Also, the 
notation suggests that Y should be integrable, so we should check that as well. But eventually, we will use this to 
define conditional expectation via E(X|Y) = E(X|a(Y)). 


Lemma 149 (Integrability) 


If Y is a version of E[X|G], then E[|Y|] < E[|X|] < 00 (so Y is integrable). 


Proof. Because Y is G-measurable, {Y > 0} is in G, meaning that E[Y4] = E[Y;Y > 0] = E[X;Y > O]. Similarly, 
i[Y_] = E[-Y; Y < 0] = E[-X:Y < 0]. So now 


a[|V |] = E[Y,.] + E[Y_] = E[X:Y > 0] + E[-X:Y <0] < E[|X|:Y > 0] +E 


X|:Y <0] <E 


X|], 


as desired. 


Lemma 150 (Uniqueness) 


If Y and Y’ are both versions of E[X|G], then Y = Y’ almost surely. 


Proof. By definition, both random variables are G-measurable, so define the event A, = {Y — Y’ > e} (which is also 


G-measurable) for any € > 0. We have that 


0 = EY; Ac] — E[Y’; Ace] = E[Y — Y¥'; Ac] > eP(Ac), 


so P(A,) = 0. Taking € + 0, we conclude that Y < Y’ almost surely, and flipping the roles of Y and Y’ shows that 


Y'’ < Y almost surely as well. Thus Y = Y’ almost surely. 


Definition 151 
Let w and v be measures on (Q, F). We say that v is absolutely continuous with respect to uz (denoted v < pL) 
if for all A€ F with u(A) = 0, we also have v(A) = 0. 


Theorem 152 (Radon-Nikodym) 
Suppose u4, v are o-finite measures on (Q,F) with vy < w. Then there is an F-measurable function f such that 
v(A) = J, fdu for all Ae F. (The function f is often written as se : 


This result basically says that if v is absolutely continuous respect to uw, then v has a density with respect to pw. 
Recall that we had a similar situation in the proof of Cramér’s theorem, in which we had an exponential tilting with 
dP» exp(0X 


7 = eee (here Pg was absolutely continuous with respect to P). 


We won't prove Radon-Nikodym right now, but we will return to it later in the course (see Theorem 215). For 


now, we will use it to show existence of the conditional expectation. 
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Proof of existence of conditional expectation. As in the definition, let X be an integrable random variable on (Q, F, P), 
and let G be a sub-o-algebra of F. Then the restriction of P to G, which we write as 4 = P\g is a probability measure 


on (Q,G). We can now define a measure v on (Q,G) by setting 


V(A) =E(X4;A) for all AEG. 


This is a finite measure, because v (Q) = E[X,] < 00 by assumption. Also, v is absolutely continuous with respect 


to us, because if A is an event of probability zero, then E(X,; A) = 0. Thus the Radon-Nikodym theorem tells us that 
there is a G-measurable function Y = a such that v(A) = i Ydwu for all AE G. 
But now notice that the left-hand side is v(A) = E[X,; A], and the right hand side is [, Ydu = E[Y; A]. Thus, 


Y is a version of E[X,|G]. Similarly constructing a version of E[X_|G] and then subtracting the two random variables 


gives us the desired conditional expectation. 


We should read the textbook for some basic properties of the conditional expectation — in particular, it’s linear and 
monotone, It satisfies a version of Jensen's inequality, and so on. But there’s a few properties that are important to 


know which are not analogous to the ordinary expectation: 


Proposition 153 


Let Gs™"' be a sub-o-algebra of G (which is a sub-o-algebra of F). If E[X|G] € Gs™", then E[X|G] = E[X|gs™"]. 


Proof. Let Y = E[X|G]; we wish to show that Y is a version of E[X|GS™"]. By assumption, Y is measurable with 
respect to GS™! so the first condition is satisfied. Also, E[Y; A] = E[X; A] for any A € G by definition of Y, so in 


particular it holds for all A € gs7!!, verifying the second condition. 


Proposition 154 (Tower property) 
Suppose we have sub-o-algebras geal C G C F as before. Then 


TE[X|9*"""]|9] = BLXIG*"*"] = BIE[X|g]I9""". 
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Proof. The proof is very similar to the one above. For the first equality, let Y = E[X|Gs™"]: we wish to show that Y 


is a version of E[Y|G]. Indeed, it is measurable with respect to gs™all and thus G, and the other condition is asking 
us to show that E[Y; A] = E[Y; A] for A € G, which is true. For the second equality, define Z = E[X|G]; we wish to 
show that Y is a version of E[Z|G°™*"]. Again Y is measurable with respect to GS™"", and the other condition asks us 
to show that E[Y; A] = E[Z; A] for any A € Gs™", which is true because both of these are equal to E[X; A] by the 
definitions of Y and Z. 


Proposition 155 
Let X and Y be random variables such that E[|Y|] and E[|XY |] are finite and X € G. Then 


E[XY|9] = XE[Y |G]. 


In other words, because X is a known constant with respect to G, we can pull it out of the conditional expectation. 


Proof. First, we show that the equality holds if X = 1g for some B € G, and we'll do this by showing that the right- 


hand side is a version of E[XY|G]. It’s measurable because both X and E[Y|G] are G-measurable, and the product of 
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two measurable functions is also measurable. For the other condition, we must check that E[XY; A] = E[XE[Y|G]; A] 


= 


for all A € G, and this is true because 


E[XY; A] |= E[1eY; A] = E[Y; An B] = E[E[Y|G]; An B] 


(in the last step we use that AN B € G because A, B € G), which simplifies to E[LgE[Y|G]; A] =| E[XE[Y|G]; A] |, as 


desired. Finally, by linearity of conditional expectation, this identity then holds for simple functions and thus general 


measurable functions. 


One last note is that we can sometimes interpret conditional expectation as an orthogonal projection. To set 
that up, notice that if G C F, then the space of random variables L?(G) is a subspace of L?(F). 


Proposition 156 


For any random variable X € L?(F), the conditional expectation E[X|G] is an orthogonal projection of X onto 


EG): 


Proof. First of all, conditional Jensen's inequality tells us that E[E[X|G]?] < E[E[X?|G]] = E[X?] (which is finite 
because X € L?(F)), so E[X|G] is integrable and thus indeed in L?(G). To show orthogonality, we need to show that 


the difference between X and E(X|G) is orthogonal to any element of L?(G). In other words, we must show that 
for any Z € L?(G), we have E[Z(X — E[X|G])] = 0. By linearity of expectation, we can write the left-hand side as 
‘[X Z] — E[ZE[X|G]]. But since Z € L2(G) is G-measurable, we have by Proposition 155 and then Proposition 154 
that 


[XZ] — E[ZE[X|G]] = E[X Z] — E[E[X Z|9]] = E[XZ] — E[XZ] = 0, 


since an ordinary expectation is like conditioning on the trivial o-algebra. 


There's one special topic in our textbook about regular probability distributions, which we'll skip for now. Instead, 
we'll move on to martingales, which is a topic with many interesting applications. Throughout the following discussion, 


we're working on a probability space (Q, F, P). 


Definition 157 


A filtration is a sequence (Fp)n>0 of o-fields such that Fp C FF, C --- C F. A sequence of random variables 


(Xn)n>0 on (Q,F,P) is adapted to a filtration (Fn)nso if Xp € Fn for all n. The natural filtration of the 
sequence (X,,) is defined by setting F, = 0(X1,--- , Xp). 


(Notice that the definition of a filtration doesn't involve any probability — it’s just a statement about families of 


sets.) In words, we “gradually reveal more information” at each step in a filtration. 


Definition 158 
Let Fo C Fy C --- bea filtration on (Q,F,P). We say that (Xp)n>0 is a martingale with respect to F, if Xp 


is adapted to Fp, E[|Xp|] < oo for all n (though we do not need to have a uniform bound across all n), and 
E[Xn+1|Fn] = Xn. 


In other words, given everything we know about the random variables up to the nth step, the next step’s expected 


value is equal to the current value. 
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Definition 159 


In the last part of the definition above, (Xp)p>o is a supermartingale if we only have E[X,41|F,] < Xn, and 


(Xn)n>o is a submartingale if we only have E[X,41|Fp] > Xn. 


(For notational convenience from here on, we will write that X, is a martingale instead of (X7,).) 


Example 160 

Let X, be a simple random walk on Z, meaning that we start at Xp = 0 and define X, = poe Y;, where Y; are iid 
and are each 1 or —1 with probability 5 each. This sequence of random variables is a martingale, because each step 
has probability 5 of adding 1 to X,, and probability 5 of subtracting 1 (so E[Xn+41|Fnl = $(Xn+1)4+5(Xn—-1) = Xn. 


The first observation we can make about martingales in general is that 


“[Xn] = i [ "[XnlFn i] = [Xn i, 


where we've applied the tower property (Proposition 154) and then the martingale property. Applying this iteratively 
tells us that 


[Xn] = [Xn 1] — “[Xo]. 


But we'll now show that we also have E[Xo] = E[X-,] for certain random times 7, and this will have many applications. 


Definition 161 
Let 7 be a nonnegative integer-valued random variable. We say that 7 is a stopping time with respect to a 
filtration (Fy)n>o if {7 = n} € Fp for all n. 


The intuitive reason for the name “stopping time” is that we can think of a martingale as modeling our wealth 
in a stock market, and we can only choose to stop our random process at time n (and hence deciding that we’re in 


the event {tr = n}) based on the information that what we already know so far (which is the o-algebra F,). The 


statement E[Xo] = E[X;] is then saying that we “cannot make money off of the stock market,” but things aren't quite 


so simple: 


Example 162 


Continuing our example from above, let X, be a simple random walk on Z started at Xp = 0, and define 
7 — inte, — eh 


Then T is a valid stopping time, because it’s a nonnegative integer-valued random variable, it’s finite almost surely 


(exercise), and at time n the event {7 = n} is measurable with respect to what we already know (it’s the same 


as the event X, = 1). But notice that 0 [Xo] 4 E[X,] = 1. 


On the other hand, it does make sense to expect that E[Xo9] = E[X;]. The key observation is that if Xp, is a 


martingale and 7 is a stopping time, then the stopped process defined by 


Yn = Xnar, where nA T=min(n,7T), 
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is also a martingale. Indeed, Y, = X, unless we have “already stopped” (meaning T <n), so we can write 
n-1 
Yn = S_ (1{t = i}X)) + 17 > nh Xp. 
i=1 


We can check that Y, is integrable for any n (exercise). From there, notice that the first sum is measurable with 
respect to Fp_1, and 1{7 > n} =1— 2 1{t = /} is also in F,_1. So conditioning the equation in F,;_1, we get 
n-1 


E[YalFn—a] = D0 Ur = }X;+ 1{r > n}E[X,|Fp-al. 


i=1 


But now by the martingale property of X, the conditional expectation on the right is X,_1, so the entire right-hand 
side reduces to Y,_1 and the stopped process is also a martingale! In particular, the earlier calculation shows that 
“[Yo] = E[Y1] = E[Y2] = ---, meaning that 


a[Yn] = =[Yo] —- 5[Xo] = 1[Xnar]- 


But nA Tt — 7 almost surely (in other words, because 7 is finite almost surely, there is some n(w) such that 
T(w) An = T(w) for all sufficiently large n > n(w)). This means that X,,- converges to X; almost surely, so 


it's reasonable to expect that with sufficient integrability conditions, we also have convergence of expectation and 


thus E[Xo] = E[X;]. So a large part of this section of the class is figuring out conditions under which we have 


’[Xnat] > E[X,]. We'll continue this discusion next time, but for now we'll describe a nice application: 


Example 163 

Let G = (V,E) be a finite connected graph, and let Z C V be its “boundary.” Let f : Z + R be a function (a 
“boundary condition,”), and suppose we perform a simple random walk on the graph G stopped at the boundary 
(meaning that we travel along the edges and always move to a random neighbor of v at each step). Let 7 be the 


first time we hit the boundary Z, and define 


Mn = eLF(X+)|Fal, 


where F,, is the filtration o(Xo,---: , Xn) of the walk. Then M, is a martingale (exercise), and its value depends 
only on the current position X,,. It turns out that M, = h(X,) is the harmonic interpolation of f (which has the 


property that the value at a vertex of the graph is the average of the neighboring values). And as an extension, if 


we replace “harmonic function” with “subharmonic function” in M, = A(X), then we'll get a submartingale instead 


of a martingale. 


18 November 13, 2019 


Last time, we defined martingales: given a filtration on a probability space, which is a nested sequence of o-fields 


Fo C Fy C++. C F, a martingale is a sequence of integrable random variables (Xn)n>0 where Xp, © Fp (the sequence 


is adapted to the filtration), and the martingale property E[Xn+1|Fn] = Xn is satisfied. (If we replace these conditions 


with an inequality > or <, we get a submartingale or supermartingale, respectively. ) 


Remark 164. Last time, we showed that E[Xo] = E[X;] =--- fora martingale. By extension, if Xp is a submartingale, 


then EX < EX, <---, and we have the reverse inequality for a supermartingale. 
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We'll prove some martingale convergence theorems today, but we'll start with some preliminary facts: 


Lemma 165 
Let X, be a martingale, and let @ be a convex function such that @(X,) € L? for all n. Then $(X,) is a 


submartingale. 


As we'll see in the proof, we can replace “convex” with “concave” and we'll get a supermartingale instead. 


Proof. Convex functions are measurable and X, € Fp, so (Xp) will be measurable with respect to F,. Thus, it 


suffices to check the submartingale condition, and indeed by conditional Jensen’s inequality we have 


E[O(Xn+1)| Fn] 2 O(E[Xn+1|Fn]) = (Xn), 


as desired. 


We can also generalize this to submartingales: 


Lemma 166 


Let X, be a submartingale, and let ¢ be a nondecreasing convex function such that $(X,) € L+ for all n. Then 


o(Xn) is a submartingale. 


(Again, it is valid to replace “submartingale” and “convex” with “supermartingale” and “concave,” respectively. ) 


Proof. This is almost identical to the previous proof — again @(X,,) is measurable with respect to F,, and we need to 


check the submartingale condition again. This time, we have 


E[$(Xn41)|Ful = onl 2(Xn+11Fn)] > d(Xn), 


where we've used conditional Jensen’s in the first equality and both the submartingale condition and @ being nonde- 


creasing in the second equality. 


Definition 167 


A sequence of random variables (Hp»)n>o0 is predictable (also previsible) with respect to a filtration (F,) if 


Hn € Fp—1 for all n. If X = (X,,) is another sequence of random variables, then the H-transform of X is defined 
via (H : On = en Hi (Xx = Xie) 


To understand this definition, suppose we play a betting game, where at each positive integer time k betting 1 
dollar means that we make net winnings of AX, = Xx — Xx_1 dollars (meaning that we win 1+ AX, dollars from the 
game). Betting exactly one dollar at each time gives us net winnings of S>;_, AXx = Xn — Xo (by a telescoping sum), 
but more generally we can choose to bet different amounts of money each time. If we instead bet H, dollars at time 
k, we do indeed make net winnings of (H- X), at time n. 

The reason that we require H, € Fp_1 is then because the amount we bet at time n should not depend on what 
we win at time n, only what has happened up to time (n— 1) (this also motivates the name “previsible’). And if we're 
in a situation where E[AX,|Fx_1] < 0 for all k (because casinos tend to make the player lose in expectation), then 


(H- X), will be a supermartingale. 
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Example 168 

Recall that a stopping time 7 is a nonegative integer-valued random variable with {7 = n} € F, for all n. If we 
define H, = 1{7 > n} (in other words, betting only if 7 didn’t happen yet), notice that H, = 1—}>°,., 1{7 = 
£} © Fy_1, So H is previsible. We can then check that (H- X)n = Xnar — Xo. 


Lemma 169 
Let X;, be a submartingale, and let H, be nonnegative and previsible. Suppose that H, € L° for all n, meaning that 
each Hy, is uniformly bounded by a (potentially different) constant almost surely. Then (H-X), is a submartingale 


as well. 


Again, the analogous statement holds replacing “submartingales” with “supermartingales.” 


Proof. Let Yn = (H+ X)n = Dopey He (Xz — Xx-1); we want to show that Y, is a submartingale. First of all, Y;, is 
integrable, because it’s the sum of finitely many terms which are each integrable (because each Hx, is bounded by a 
constant, and each Xx is integrable because X is a submartingale). Also, Y, € F, because all of the individual terms 


are in Fy. So it suffices to show the submartingale condition, which can be written as 


? ? 
EYn+11Fnl 2 Yn = 1[Yn+1 — YalFn] > 0, 


since E[Y,|F,] = Y, because Y;,, is F,-measurable. And this indeed holds: notice that 


[Yn+4 _ YnlFal me EL An+a(Xn41 _ Xn)|Fal = An44 OU(Xn41 _ Xn)|Fal 


(where we use that H,+1 is measurable with respect to F,), and this conditional expectation is nonnegative because 


Xp iS a submartingale (and Hp+1 is nonnegative), as desired. 


Corollary 170 
If X, is a submartingale and 7 is a stopping time, then Y, = Xp,z is also a submartingale (because the stopped 


process can be expressed as an H-transform by Example 168). 


One goal of today’s class (which we should keep in mind) is to prove that any nonnegative supermartingale X,, 


converges almost surely to some integrable limit X.. with E[X..] < E[Xo]. Because E[Xo] > E[X,] > --- for a 


supermartingale, but all expectations are bounded from below by 0, this fact shouldn't be too surprising. But the main 
technical component of this proof is to show a bound on how often a submartingale can “go up,” which is equivalent 


to a bound on how often supermartingales can go down. 


Definition 171 


Fix constants —co < a < b < oo and let X, be a martingale. Define 7 = —1, and for all positive integers k, 


define Tox-1 = inf{n > Toxo: Xn < ab and Tox = inf{n > Tox-1 : Xp > bh. 


In other words, the 7)s are the times where X, goes below a, then above b, then below a, and so on. (And if X 
never goes below a, we have 7, = oo.) By definition, we have 7) < 7 < T2 <---, and we want to keep track of the 
ranges (T2x-1, Tox] — these are known as upcrossings because we're going from a up to b. For any n, the number of 


upcrossings U,(a, b) of [a, b] completed by time n is then U,(a, b) = sup{k : Tox > n}. 
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Theorem 172 (Doob’s upcrossing inequality) 
Let X, be a submartingale. Then for any -co< a<b<o, 


(Xn — a)+ — (Xo = a)4] — El(Xn ~ a) +] 
b-a ze b-a ; 


[Un(a, b)] < 


Proof. Let H, be the indicator variable for time n being inside an upcrossing (in other words, Hp, = 1 if n © (Tox-1, Tox] 


for some k). Equivalently, we may write 


An = S- 1{n € (F241, Tak] } = S- 1{n < Tax} — 1{n < Tax-1} 
k>1 k>1 
because the different upcrossings are disjoint. (If we're worried about having an infinite sum, notice that we really only 
need to sum up to n because there can’t be more than n upcrossings by time n.) But because each 7; is a stopping 
time, each indicator 1{n < 7;} is measurable with respect to F,_1 (again by the same logic as Example 168), so Hy, 
is previsible. (Intuitively, we know at time (n — 1) whether we're in an upcrossing at time n, because the upcrossing 
interval includes Tox.) Since H, is bounded between 0 and 1 almost surely, we can also define K, = 1— Hy, and then 
Lemma 169 tells us that (H- X), and (K - X), are both submartingales. 

The idea now is that betting during each upcrossing corresponds to an increase in X of (b— a), but there is a 
minor problem: we may start an upcrossing and then have X become arbitrarily negative, which is bad for our bound. 
So we will define 

Y, = max(a,X,) =at (Xp —-a)4, 


which is also a submartingale by Lemma 166 because max(a, x) is a convex function. Lemma 169 still applies, so 


(H-Y)p and (K-Y), are both submartingales as well. Now notice that 
(H . Yn 2 (b ae a)U,(a, b), 


because we can interpret this H-transform as “only betting during the upcrossings of Y,” in which we win the increment 
(b — a) from each upcrossing (because at the start of each upcrossing we have X;, < a, so Y; = a, and at the end of 
each upcrossing we have X, > b, so Y, > b) and may even win an extra amount at the end. (Here is where using Y 
instead of X is important — we wouldn't necessarily know that the extra amount won is nonnegative with X.) Taking 


expectations on both sides, we thus have 


ELH -Y]n 
b-a 


[Un(a, b)] < 


and we just want to upper bound this last quantity in terms of X. By definition, we know that 
Yr —YoH ll Vn =(A- Yet (K-Y)a: 


and rearranging and taking expectations tells us that 


tI(H-Y)n] = E[Yn — Yo — (K+ Yn]. 


But Y, — Yo = (Xn — a)4. — (Xo — a) by definition, and (K -Y), is a submartingale started at zero, so E(K-Y),, > 0. 


Plugging both of these into the boxed expression gives us the desired result. 


Intuitively, the key inequality here is that E[(K - Y),] > 0. In words, it’s hard to have many upcrossings because 


it’s hard for the submartingale (K - Y), (the amount of money we make betting outside of upcrossings) to be very 
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negative. 


Theorem 173 (Submartingale convergence theorem) 


If X, is a submartingale with sup, E[(X,)+] < co, then X, converges almost surely to some X, € L?. 


Proof. For any real numbers a, x € R, we have (x — a)4 < xi + |al, so Theorem 172 also tells us that 


o[(Xn)+ + lal] 
b-a : 


[Un (a, b)] < 


By assumption, E[(X,)+] is uniformly bounded in n, so the right-hand side is some finite value and thus the ex- 
pected number of upcrossings across [a, b] is uniformly bounded. By the bounded convergence theorem, the limit 
liMp+soo Un(a, b) will be some random variable U,,(a, b) with finite expectation, and any random variable with finite 
expectation is finite almost surely. This means that U,.(a, b) will be finite for all rational -oo < a < b < co almost 
surely (by countable subadditivity). This means that our sequence Xj, X2,--- does not cross any rational interval 
infinitely many times, SO Xoo = liMp +00 Xn indeed exists almost surely. 

It remains to show that X,, is finite almost surely and that it is integrable. By Fatou's lemma (because (X,)+ 


converges to (X.)+), we have 


(Xoo) 4] S liminf E[(%n)4] < 06 


(by assumption), so (X..)+ does have finite expectation. Fatou’s lemma also tells us that E[(X..)_] < liminfp+oo E[(Xn)_], 


but we haven't stated explicitly in our assumptions that this quantity is finite. However, we can write 


lim inf E[(X,)—] = lim inf E[(Xn)+ — Xn. 


noo 


Now E[(X;,)+] is bounded, and because X,, is a submartingale we know that E[—X,] < E[—Xo]. Thus we do indeed 


have 


E[(Xoo)-] < lim infE[(Xn)_] < liminf E[(Xn)+] — E[Xo] < oo. 


So Xqo is in L? because both its positive and negative parts have finite expectation, as desired (which also implies that 


it is finite almost surely). 


Theorem 174 


If Xp is a nonnegative supermartingale, then it converges almost surely to some X , with E[X.] < 


Proof. Since Y, = —Xp is a submartingale, and (Y,)+ = 0 almost surely for all n (in particular, E[(Y,)+] is uniformly 
bounded), the submartingale convergence theorem tells us that we have almost-sure convergence Y, — Yoo, SO Xn 


converges almost surely to X,5 = —Yoo Since X,S are all nonnegative, so is X., and Fatou’s lemma then tells us that 


1[Xoo] < lim inf E[X,] < E[Xo], 
n—+00 


where the last inequality comes from X,, being a supermartingale (so E[X9] > E[X] > ---). 


However, we should be careful — it is not true in general that we will have E[X,] > E[X..]: 
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Example 175 
Let X,, be a simple random walk on Z, and let Xp) = 1 and 7 = inf{n > 0: X, = O}. Then Y, = Xnpaz isa 


nonnegative martingale (which is also a supermartingale), so it converges almost surely. Specifically, it converges 
to Yoo = X; = 0, so we have E[Y,] = 0 but E[Y,] = 1 for any n. 


To conclude the lecture, we'll see an application of the convergence theorem: 


Example 176 (Branching process) 
Let X,,; be an array of lid nonnegative integer random variables that have the same distribution as X. The law 


of X is some probability measure p supported on the nonnegative integers called the offspring law. In a Galton- 


Watson tree, we start with a root vertex which gets a random number of children (possibly zero) X11, which 


form the first level of the tree. From there, the next level is formed by giving the first child X2,1 children, the 


second child X2,2 children, and so on. 


This tree may be finite or infinite, and the Galton-Watson process Z,, is a summary of this tree, where Z, denotes 


the number of vertices at level n. If we now define 
Fp =o(Xei 1 >1,1<2<n), 


then the process Z, is adapted to the filtration F, (because we only need the X,,;s up to level n to determine the 
number of children at level n). And in fact, #, encodes more than just o(Zo, Z21,--- , Zn), because the Zjs only tell 
us the total population in each generation and not which parents they come from. One way to write this process more 


mathematically is that 
Zn-1 


Zt, ZS ye 
i=1 


In particular, if Z,-1 = 0 (there is no population at level (n—1)), then Z, = 0 (all future populations are dead). More 


generally, 


*[Zn+1|Fn] =Zp o[X], 


because each of the Z, children generate E[X] children on average. So this means that aor is a nonnegative 


martingale, and next time, we'll use this fact to study what happens to the random tree as n gets large. 


19 November 18, 2019 


As a reminder, the next homework assignment is due on Wednesday (there are two extra problems from the original 


assignment). Last time, we proved the submartingale convergence theorem: if E[X,)4] is uniformly bounded, then 


Xn converges almost surely to some finite random variable X,,. In particular, this implies that any nonnegative 


supermartingale X;, converges almost surely to some integrable limit X... However, neither result implies that E[X,] > 


i[Xo0] converges, and it is useful to know when that statement is true. For example, if X, = Mraz is a stopped process, 


then we want to find conditions under which E[M,,7] — E[M-;] (in particular, this would imply that E[M-;] = E[Mo]). 
So the main result we'll be proving today is the L? martingale convergence theorem, which states that if X, is a 


martingale with sup, ||Xn||p < co for some 1 < p < ov, then X, + X almost surely and in L?. It will turn out that 


this also implies that X, — Xo in L', so E[X,] — E[X..] (which is what we actually care about). And notably, this 


result would not be true if we used p = 1 instead. 
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Remark 177. Last lecture, we mentioned that many results that hold for submartingales also hold for supermartingales 
(with reversed inequalities and assumptions, and thus with equality for martingales). This will hold for many of the 


results we prove in the next few lectures, as we can see in the proofs. 


Lemma 178 (Doob’s maximal inequality) 
Let X, be a submartingale, and define the nondecreasing process X, = max {(X;)4 :0< i <n}. Then for any 
A> 0, 


AP (grax Me ») = AP(X;, 2 A) < E[(Xn)+i Xn = A] < E[(Xn)+]. 


O<i<n 


Proof. The first equality is the definition of X,, the second inequality holds because X, < (Xp), and the last inequality 
is true because (X,)+ is always nonnegative (so only counting the expectation on {X, > A} can only decrease the 
expectation). Thus, we only need to check the first inequality (which would be Markov’s inequality if X, were replaced 
with Xp. 

First of all, because X, is a submartingale, if 7 is a stopping time with 0 < 7 < k almost surely, then E[Xo] < 
E[X,] < E[X;,] (exercise). So defining o = inf{i > 0: X; > A} and T = a An (so that 0 < T < n), we have 
“[X,] < E[X,]. But we can only have X, 4 X, if o <n (meaning that we've already hit at least A before n), which 


means that X, >» whenever X,; # Xn. This means that we also have 


[Xi Xn > Al > E[X7; Xn > A] > AP(Xp > A) 


(first step because E[X,] < E[X,] and outside of the event X, > X the two variables are equal, and second step 


because X; > whenever X, > d). This shows the desired result. 


Corollary 179 (Kolmogorov maximal inequality) 


Let M, be a martingale with My = 0, and let X, = M? (which is a submartingale because f(x) = x? is convex). 


Then applying Lemma 178 to X,, we have 


E|(X. Var(M 
P( max |Mj| > x) =P( max |Xj| > x?) < [(Xn)+] = ar( n) 
O<i<n 0<i<n aa x2 


where we've used that E[M,] = E[Mo] = 0. 


This result may look similar to Chebyshev’s inequality, which tells us directly that 


Var(M 
P(|Mn| > x) < Mane) 
x 
but this version is stronger because we're able to bound multiple Mjs at once. 


Theorem 180 (L° maximal inequality) 
Let X, be a submartingale, and define X, = max{(X;)+ : 0 < i <n}. Then for all p € (1,00), we have 


| Xallp S Say WX) 


In other words, the maximum of the first n values of the submartingale can be controlled by just the final one. 
(However, note that this is really only a statement about the positive part of X, — it doesn’t prohibit X, from being 


very negative. ) 
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Proof. One annoying detail is that we don’t know in advance that Xn is in L’, so we will truncate and work with 
X, AM for some finite constant M. We claim that 


AP(Xn AM > 2) < E[(Xn)4: Xn AM > Al. 


Indeed, this is trivially true when M < 2 (because both sides are zero), and otherwise this reduces to Lemma 178. 


Thus we have (with our usual integration trick) 


E [(Xn AM)? =} py PRAMS y)dy < f py’ 7E [(Xn)4iXn AM >A]. 
0 0 


By Tonelli’s theorem (everything in the integrand is nonnegative), we may swap the order of integration, and the 


right-hand side becomes 


B[(Xn)e [> ay? 21% AM> vidy] =E |X) ramet Po 


1 / p 
| i p' = 5°7), we have 


So by Hélder’s inequality applied to (X,)4 and (XA M)?-+ (where 7 


— p ee -1 Pp aval p-l 
E[(XnAM)?] < pol Xn)4llp [| Xn AM)? | ea Xn)ellp [Xn A M||, 
Since the left-hand side is ||X,, A M\|>. rearranging this inequality yields 

= p 
I|XnA M||p < p 1! Xn)+lle- 


Now if the right-hand side is infinite, the theorem is automatically true, and otherwise we can take M — oo and use 


the monotone convergence theorem to finish. 


Theorem 181 (L? martingale convergence theorem) 


Let X, be a martingale. If there is some p € (1,00) such that sup, ||Xnl|p < oo, then X, — Xoo almost surely 


and in LP. 


Proof. We have sup, E[(Xn)+] < sup, E[|Xpl] < 00, because ||Xpl|p is uniformly bounded and ||Xall1 < ||Xnllp (by 
Corollary 75). So X, satisfies the conditions of the submartingale convergence theorem (Theorem 173), and thus 
Xn 7 Xoo almost surely. 


To show convergence in L®, we apply the L? maximal inequality, which tells us that 


p 
pe 
max (Xi) + ars (Xn)+ll, » 


p 


and also that 


p 
os 
ex(%)-|| < Po MN%n)-llp 


(because both X, and —X, are submartingales if X, is a martingale). By assumption, both right-hand sides are 
uniformly bounded in n, and the left hand sides are nondecreasing in n, so Supj>9 |X;| = Y is an L?-integrable random 
variable (by the monotone convergence theorem applied to the variables maxi<i<n Xi|”). 


So now X; + Xoo almost surely, meaning |X; — X.|P + 0 almost surely. But |X, — X.|? is dominated by (2Y)?, 


which is integrable, so the dominated convergence theorem tells us that E [|Xp — X0|?] — 0, which is the desired L? 


convergence. 
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We're now ready to apply these results to branching processes. Recall the definition of a Galton-Watson tree: 


we have a double infinite array of random variables (Xp,i)n>1,i>1 all iid to an offspring law X which is supported 


on Zso and has finite mean 4 = E[X]. We start with a single root vertex, which generates X1,1 children at depth 
1. Then the ith child at depth (n — 1) generates X,,; children at depth n, and the Galton-Watson process Zp is 
just the number of vertices at depth n, and we have the equation Z, = pea, Xn,i- Last time, we showed that if 
Fr, = 0(Xe; for all / > 1,1 < 2< n) (that is, the filtration tells us the structure of the tree up to depth n), then 


Vis . 
#[Zn+1|Fn] = UZn Mn a is a martingale. 


Furthermore, if M, is a nonnegative martingale, we know that M, converges almost surely to some limit M, with 


E[Moo] < E[Mo] = 1. But from here, the behavior depends on the mean : 


* One case that is easy to understand is the subcritical Galton-Watson, where w < 1. By Markov’'s inequality 


(using that Z, is nonnegative-valued), 


P(Z, > 0) = P(Z) > 1) < Zl — ie", 


which is exponentially small in n. Thus Z, must converge to 0 almost surely, so the population goes extinct with 
probability 1. In fact, we even have that M, = a — 0 almost surely in this case, because Z, Is integer-valued 
so it must eventually be zero. So the limit M,, of the martingale is identically zero, and in fact this is an example 
where E[M,.] =0 < 1=E[Mo]. 


+ The other cases are more interesting. {4 = 1 gives us the the critical Galton-Watson, and for this case we 
will assume that P(X = 1) < 1 (because otherwise we just have one child at each level and nothing interesting 


happens). It turns out that extinction will occur here again, but the proof is less straightforward: 


Proposition 182 


In the critical Galton-Watson process, we have Z, — 0 almost surely. 


Proof. Because tw = 1, Zp is a nonnegative martingale (and thus also a nonnegative supermartingale), so it converges 
almost surely to some limit Z. Since Z, is integer-valued, Z,. will also be integer-valued, and in fact (to have 
almost-sure convergence) we must have Z,(wW) = Z..(w) for all sufficiently large n > n(w). In other words, we have 
that 
°(U LJ {Zn = k for all n>) =a 
k>0 m>0 

(where k corresponds to Z.(w) and m corresponds to n(w)). It suffices to show that this event cannot occur with 
positive probability for any k > 0. Intuitively, this is because we're asking the process to stay at k children forever, but 
X is a nondegenerate random variable so this cannot always happen. To make that rigorous, note that for any k > 0, 


there is some constant cx < 1 (independent of n) such that P(Zp41 = k|Z, = k) = cy. Then 
P(Z, =k for all n € {m,---,m+2})< rats 


which decreases exponentially with £, so the probability that Z, = k for all n > m is zero. The countable union of all 


such events over all k > 0 and all m thus has probability zero,so we indeed have 


P(U2.=o fora n> m) =a 


m>0 
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This is exactly the same as having Z, converge to 0 almost surely. 


+ The most interesting case is the supercritical Galton-Watson, where w > 1. This time, if the probability po 
of producing zero children is zero, then Z, > 0 almost surely for all n (because every vertex produces a positive 
number of children). But if po > 0, then the tree does go extinct with some positive probability (for example, 
if all vertices at a given level produce zero children). Intuitively, though, once the tree has survived for 100 
generations, it will typically be pretty big, so we expect the tree to survive with some sizable probability. So the 


interesting question here is whether the probability of nonextinction is positive. 


To answer this question, we condition on the first level of the tree. If the first level has (for example) 3 children, 
then the (n+ 1)th level of the tree dies out if and only if the 3 level-n trees rooted at those children all die out. 


This yields the recursive formula 
P(Z.1=0) = > peP(Z, = 0) =H PZ, =), 
k>0 


where 


(5) = > past = E[s*]. 


k>0 


iS a reparameterization of the moment generating function. Because X is nonnegative, 0 < s* < 1 when 


0 <s <1. Thus E[s%*] is finite for s € [0,1] (which is the range in which we are applying #), meaning this 


function is indeed defined. We can also check that w is nondecreasing and convex in s, and W’(s) tuasst 1 
(by differentiating term-by-term and using the dominated convergence theorem on the partial sums). Since 
w(1) = 1 and W(0) = po > O, the graph of w will have the general shape shown below: 


Po 


Specifically, because the slope of was s > 1 is w > 1, our function will intersect w(x) = x at some unique point 
0< p< 1 (because w(x) — x is convex, has negative derivative at s = 0, and has positive derivative at s = 1). 
We can now write 

P(Zn = 0) = W(P(Zn-1 = 0)) = W"(P(Zo = 0)) = p"(0). 


Repeatedly applying a convex function will bring us towards a fixed point (which cannot be 1 because w(x) < x 
for x near 1), and thus P(Z, = 0) — p as n > oo. Finally, because the events {Z, = 0} are nested (if we're 
extinct at time n, we're extinct at time (n+ 1)), their probabilities are nondecreasing and increase to the event 
{tree goes extinct}. Thus by continuity from below, the probability that the Galton-Watson tree goes extinct is 


indeed p <1. 


For a more complicated question, suppose we have a supercritical Galton-Watson tree and condition on the event 


that we do not go extinct. Then we may ask about the limiting distribution of Mo. = liMn—+oo ae Knowing the answer 
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tells us a lot about the process — if we knew that M,, converged to some finite limit in (0,00), then we'd know that 
Zn grows like “4.” up to a constant factor (which is much stronger than just knowing for example that Z, = u"+°”). 


It turns out that we have an exact characterization of when this happens — the same recursion as before tells us that 
wW(P(Ms = 0)) = P(M = 0). 


The only fixed points of w are 0,p, and 1, and we can't have P(M,, = 0) because extinction implies M,, = 0 and 
occurs with positive probability (here we use that po # 0, so p #0). But it turns out that the other two possibilities 


both occur: 


Theorem 183 (Kesten-Stigum L log L criterion) 


In a supercritical Galton-Watson tree, Mo, = lim = is not identically zero if and only if the offspring law X satisfies 


E[X log X] < oo. 


We'll see a special case of this on our homework, and we'll also discuss this topic more in a later lecture. 


20 November 20, 2019 


Our last homework assignment is already posted on Stellar, and it’s due on Monday, December 2 (the class day before 
the exam). There will be lecture that Monday but not on Wednesday, and we should make sure we have enough time 
to both study for the exam and finish the problem set. 

We've been discussing martingale convergence theorems in the last few lectures. We'll start with a brief review: if 
Xp is a submartingale satisfying a moment condition, then it converges almost surely (by Doob’s upcrossing inequality). 


As a consequence, any nonnegative supermartingale converges almost surely. However, we have no guarantee that 


there is convergence in L?, and indeed we do not always have E[X,] 4 E[X,.]. Last time, we showed a sufficient 


condition with the L? martingale convergence theorem: if there is some p € (1, 00) such that sup, ||Xnl|p < co, then 
Xn — Xoo almost surely and in L?. Then if X, + Xoo in L?, then it converges in L, so the expectations also converge. 


But it turns out that we can get an exact characterization for convergence in L, which we'll discuss today. 


Definition 184 
Let (X;)jc; be a family of random variables on a probability space (Q, F,P). Then (X;) is uniformly integrable 


(also u.i.) if 


lim supE[|X;|; |X;| > M] = 0. 


M-+oo je] 


This is related to the idea of tightness of a set of measures (but isn’t exactly equivalent). Recall that for any 


integrable function X, the dominated convergence theorem tells us that E[|X|;|X| > M] — 0 as M — oo, but uniform 


integrability is a uniform condition on the whole set of random variables. 


Remark 185. Note that if (X;)jc; is uniformly integrable, then for any X; we can write 


[|Xi|] = El|Xi; [Xi] < M] + El|X;; |X;| > M] < M+ E||X;|; Xi] => MI. 


Then we can make the second term uniformly small (say smaller than 1) by taking M to be sufficiently large. In 


particular, this shows that | sup E[|X;|] < oo |, so uniform integrability is stronger than having a uniform bound on the 
i€l 


L! norm. In particular, this means that for any fixed M, the supremum in Definition 184 is finite. 
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We'll describe one natural way to construct a family of uniformly integrable random variables: 


Lemma 186 


For any random variable X € L1(Q,F,P), {E[X|G] : G C F sub-o-field} is uniformly integrable. 


Proof. First, we claim that for all € > 0, there is some 6 > 0 so that for all A € F, 


P(A) <6 => El|X|;A] <e. 


(This was on our homework — the idea is that if this didn’t hold, then we can find a series of events A, with P(A,) | 0 


but E[X; A,] 4 0. But X is in L?, so this is a violation of the dominated convergence theorem.) Turning to the proof 


of the lemma, by Markov’'s inequality, conditional Jensen’s inequality, and the tower property, we have 


EUEIX|9)|] — EIE(IX| 9] _ EX!) 


M - M M 


P 


— 


{[X|G]| > M) < 


We will be using the various {|E[X|G]| > M}s (for sub-o-fields G) as the potential events A, and our goal is to bound 


{[E[X|G]; |E(X|G)| = M] = ElIE[X|9] - 1{]E(X|9)| = Ml 


by writing it in the form E[|X|; A]. But since 1{|E[X|G]| > M} is G-measurable, and thus we can put it inside the 


expectation as well and use conditional Jensen's and the tower property again, simplifying the expression above to 


E [|E[X - 1{/E(X|G)| > M}|9]I] < E[E[|X| - YJE(X|9)| = M}|9]] = E [|X]; [E(X19)| = MI]. 


By the claim at the beginning of our proof, we can make this expression at most e€ (for all G simultaneously) if we 


make P(|E(X|G)| > M) sufficiently small. But we have shown that those probabilities are bounded by ET so taking 


M large enough yields the result. 


Theorem 187 


Let X;,, be integrable random variables. If X, — X converges in probability, then the following are equivalent: 


1. {Xn}n>o is uniformly integrable, 


Pe RES cae 


3. E[|Xnl] > E[|X]]. 


This result is true for random variables in general, but we'll apply it to martingales. 


Remark 188. We will use the bounded convergence theorem in our proof, so we will mention now that our measure 
theory convergence theorems can be strengthened. Instead of proving that [ f,du — f fdu, both results (with slight 
modifications) actually prove the stronger result that | |f, — f|du — 0. Additionally (using that convergence in 
probability implies almost-sure convergence along a subsequence), the dominated convergence theorem only requires 


convergence in probability. 


Proof. First, we show that (1) = > (2). By Remark 185, uniform integrability implies that sup, E[|Xp|] < oo. Again 
recalling that convergence in probability implies existence of a subsequence Xp, —> X converging almost surely, Fatou’s 


lemma shows that E[|X|] < lim infx soo E[|Xn,«|] < 00, so X € L+. To show convergence in L?, define the “truncated 
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identity function” 


M x>M 
d(x) =4-M x<M 
x otherwise. 


Then by the triangle inequality, we can write 


[Xn — X|] S EllO(Xn) — OXI] + EllO(X) — X1] + El Xn — (Xn) 


Because @ is continuous and uniformly bounded, and X, — X in probability, we also have $(X,) + @(X) in probability 


(for example by using the characterization that “all subsequences have a further subsequence converging almost surely’). 


Then E[|6(Xn) — &(X)|] — 0 by the (slightly stronger, as in Section 20) bounded convergence theorem, so the first 


term goes to zero. 
Next, because |@(X) — X| < |X|1{|X| > M}, the second term can be upper bounded by E(|X|; |X| > M). Finally, 
the third term is bounded by E[|X,|, |X,| => M]. Thus, we can make the right-hand side arbitarily small by first fixing 


M and taking n + co to make the first term small, and then take M — oo (using uniform integrability here) to make 


the last two small. Thus E[|X, — X|] > 0, showing the desired L? convergence. 


To show that (2) = (3), we use Jensen's inequality to find that 


[E[|Xnl] — EX] = IEUXnl — XII] S ELlXal — XI] S$ ElXn — X1] 


(where the last step is just a property of real numbers). Since the right-hand side goes to zero by assumption, so does 
the left-hand side. 
Finally, for (3) ==> (1), it suffices to show that 


lim (im sup E(|Xal; |Xn| > mM) =0 
M-00 \ n-00 


(the usual definition uses sup instead of limsup, but to get to sup, we can deal with finitely many exceptions for Xp, 
by picking a large enough M for each one and then taking the maximum). Consider the function 
x Ix} <M—-1 
Wu(x) = 40 |x] >M 


linear interpolation otherwise. 


Notice that wy is continuous and that |X|1{|X| > M} < |X] — |w(X)| < [X|1{|X]| => M— 1}. Thus, 


lim sup E[|Xn]; |Xn] 2 M] < lim sup E[]X;,| — |e(Xn)I]. 
n->oo 


n->oo 


now using the bounded convergence theorem (because w is uniformly bounded by M) on the second term and assump- 


tion(3) on the first term means that 


lim sup El|X nl]; |Xn] 2 M] < El|X] — |b(X)I] S EX: |X| 2 M — I]. 


noo 


As we send M —+ oo, the right-hand side goes to zero (because X is integrable), so the left-hand side does as well, 


proving uniform integrability. 


We can now apply these results to martingales: 
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Theorem 189 (Submartingale L! convergence theorem) 


For a submartingale (X;)n>0, the following are equivalent: 
1. Uniform integrability, 
2. Convergence almost surely and in L?, 


3. Convergence in L?. 


Proof. To show that (1) = > (2), we again have that sup, E|X;| < oo from our earlier observations. Thus (by the 


submartingale convergence theorem, Theorem 173), X, converges almost surely to some limit Xj, meaning X; 4 Xoo 


in probability as well. Thus we can apply Theorem 187 to show that uniform integrability also implies Lt convergence. 


Finally, (2) == (3) is trivial, and (3) == (1) is a consequence of (2) ==> (1) in Theorem 187. 


Theorem 190 (Martingale + convergence theorem) 


For a martingale (Xn)n>0, the following are equivalent: 
1. Uniform integrability, 
2. Convergence almost surely and in L?, 
3. Convergence in Lt, 


4. The existence of a random variable X € L! such that X, = E[X|F,] for all n. 


The first three conditions are the same as above (since martingales are submartingales), but this theorem also 
tells us that if we have a uniformly integrable martingale, then there is some integrable random variable for which that 


martingale is just gradually “exposing information.” 


Proof. (1), (2), and (3) are equivalent by Theorem 189, and (4) ==> (1) is Lemma 186, so we just need to show 


that (3) ==> (4). By repeatedly applying the martingale property, we know that X, = E[X¢|F,] for all 2 > n, which 


is equivalent to saying that 


£[Xn; A] = E[X¢; A] for all Ae F, and 2> n. 


Taking £ — oo, we have E[X,; A] = E[X.; A] because X; 4 Xx in L* (so if E[|X~¢ — Xx|] — 0, we also have 


U[]1a - (Xe — Xo0)|] + 0). Because everything is integrable here and we have satisfied the conditional expectation 


identity, E[X..|Fn] = Xn for all n, and X. is our random variable X. 


Definition 191 
Let F, bea filtration on (Q,F,P). For any X € L1(Q, F, P), the sequence X, = E[X|F,] is the Doob martingale 
of X with respect to F,. 


As mentioned, the idea of such a sequence is that we reveal the randomness of X “a little bit at a time,” and what 


we've just shown is that any martingale converging in L! is exactly a Doob martingale for some X. 


Theorem 192 (Lévy convergence theorem) 
Let F, be a filtration of (Q,F,P), and define F. = o(U, Fn). Then for any X € L?, we have 


’[X|Foo] almost surely and in L?. 
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Remark 193. The union of sigma-algebras is not itself a sigma-algebra, but here we are taking the sigma-algebra of 


the union. This is sometimes denoted Fy t Foo. 


This is a statement about the limit of the Doob martingale. We should compare this with Theorem 190 — in that 
case, we explicitly constructed a random variable X which satisfied the properties we want. But F,, can be strictly 
smaller than F, and the Lévy convergence theorem holds even if X is some arbitrary measurable function on F (even 
if it’s not in F). 


Proof. Define X, = E[X|F,]. We know that {Xn}n>o is a uniformly integrable family by Lemma 186, so it converges 


to some X,. almost surely and in L+ by the L? martingale convergence theorem (Theorem 190). So we just need 


to show that X, = E[X|F.], which is an exercise with the definition of conditional expectation. We know that 


Xoo © Foo, because it is the almost-sure limit of the X,s, each of which is measurable with respect to F, and thus to 


F.. For the conditional identity, notice that 


Xn = E[X|Fp] => E[Xn: A] = E[X; A] = E[Xo0; A] for all A € Fp, 


where the last equality comes from X, = E[Xx|F_y] (again by the proof of Theorem 190). Applying this for an arbitrary 
n, we find that E[X; A] = E[X.; A] for all A € Un>o Fn. To finish, we extend this to the sigma-algebra o (Un>0 Fal 


with the standard m— A argument. 


Applying this result to the indicator random variable 1, (for any A € F,,) yields the following useful result: 


Corollary 194 (Lévy 0-1 law) 
Let Fyr t Foo. Then for any A € Fy, we have P(A|F,) — 1, (so in particular, the conditional probability 


converges to either 0 or 1). 


We'll finish with a few helpful hints for the homework. The Azuma-Hoeffding bound says that if X, is a 


martingale with bounded increments, then X, — E[X,] is well-concentrated. One consequence is the concentration of 
the independence number in G,,, (the random graph of n vertices where each pair is connected with probability p). 
Basically, if G = (V, E) is a graph, then S C V is an independent set if no two vertices in S are neighbors in G. The 
independence number a(G) is then maximal cardinality of such a set |S| (which is some number between 1 and |V). 


We are asked to determine the behavior of X = a(G) as a random variable. One useful way to define a filtration is 
Fy =o (edges restricted to the first 2 vertices) . 


This is known as an edge-revealing filtration, in which we're told at each step how a new vertex is connected to the 


currently revealed graph. There are variations on this idea, but the point is that the random variables E[X|F,] have 


no integrability issues (because G is a finite graph) and form a Doob martingale, so using the Azuma-Hoeffding bound 
will give us a concentration bound! And this kind of strategy also works for the clique number and chromatic number 
of a random graph, too. (This general type of argument comes from a paper by Shamir and Spencer [9].) But as we'll 


see on our homework, our asymptotic bound will only be good for some values of p. 


21 November 25, 2019 


In the last few lectures, we've shown some results about submartingale convergence: specifically, if X, is a submartingale 


with sup, E[(Xp)+] finite, then X, + X.. converges almost surely with E[|X.o|] < oo. Also, uniform integrability, 


convergence almost surely and in L?, and convergence in L! are all equivalent for a submartingale. 
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Today, we'll talk about the optional stopping theorem. Recall that a stopping time 7 with respet to a filtration 


Fy is a random variable such that {7 = n} € Ff, for all n. As discussed previously, for any submartingale X, and any 


stopping time that is almost surely bounded by 0 < T < k, we have E[Xo] < E[X,] < E[X;]. More generally, this tells 


us that E[Xo] < E[Xran] < E[Xp] for any finite n, but we can't just replace n with oo here. So today’s discussion will 


explain what we can do when we do take this limit. 


Lemma 195 
Let X, be a uniformly integrable submartingale and 7 be any stopping time (potentially with P(t = co) > 0). 


Then E[|X;|] < co. 


Proof. Because X, Is uniformly integrable, it converges to some limit X, almost surely, so the limit exists and X, 


is well-defined (even when T = oo). Since f(x) = max(x,0) is a nondecreasing convex function, (X;,)+ is also a 


submartingale, so we also have E[(Xp,7)| < E[(X,)+] for any n. The right-hand side is uniformly bounded (by uniform 


integrability) by some constant, which means that sup, E[(Xp,7)4] < oo. But (Xnp,q7) is a stopped process of (X,,), so 


it is a submartingale as well. Thus by the submartingale convergence theorem, Xna7 converges to a limit in Lt. But 


that limit is exactly X; (whether 7 is finite or infinite). 


Corollary 196 


If X, is a uniformly integrable submartingale, and 7 is any stopping time, then Y, = Xj,,-, Is also a uniformly 


integrable submartingale. 


Proof. We wish to show that 


lim (sup [Xnarli Xnarl = mM) =, 


oo n 


We can split up the expectation into two cases, based on whether n or 7 is smaller, to get 


sup E[|Xnarli |Xnar| > M] = supE[|X,|-1{7 > n} + |X7|1{7 < nh; |Xaar| > MI 
n 


n 


< sup ( Xn 1{|Xnl = M})] + EX]; |X7| = M])- 


But because the X, are uniformly integrable, the first term goes to zero as M —+ oo, and because X, € L? (by 


Lemma 195), the second term goes to zero as M — oo by the dominated convergence theorem. 


Theorem 197 


If X, is a uniformly integrable submartingale and 7 is a stopping time, then E[Xo] < E[X,] < E[Xq]. 


As we will see in the proof, the first inequality E[X9] < E[X-] holds even if we only know that the stopped process 


Xnar |S a uniformly integrable submartingale. 


Proof. We have previously shown that E[X9] < E[X;,,] < E[X,] for all finite n. Since X-,, is uniformly integrable 
(by Corollary 196), it converges to X; in L!, so E[Xo] < E[X;]. 


£, 


For the other bound, since X, is uniformly integrable, X, converges to X,, almost surely and in L?. So X;,n and 


Xn converge almost surely and in L! to X; and Xo, respectively, and because E[X;,,] < E[X,] for each n, taking 
limits on both sides yields E[X;] < E[X.], as desired. 


However, uniform integrability can often be difficult to check, so here's an easier criterion: 
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Theorem 198 
Suppose X,, is a submartingale such that E[|X,41 — Xn||F,| < B for all n. Then if 7 is a stopping time with 


EIT] < co, then X,,-7 is uniformly integrable (so in particular we do have E[Xo] < E[X;]). 


Proof. For any n > 0, we can write 


[Xnarl < IXol + D0 17 > A} Xia — Xi 


i>1 


by the triangle inequality. Call the right-hand side Y, and notice that it is independent of n. Now 1{7 > /} is measurable 


with respect to F;, so the expectation of Y can be written (by the tower property) as 


2[Y] = E[|Xol] + 50 E[E[1{7 > i} - (Xian — Xi (Fil 
i>0 
= E[|Xol] + SO E[1{r > i} -E [Xin — Xi (Fill 
i>0 


But now using our assumption, we can show that Y is integrable: 


[Y] < El|Xol] + )_ E[1{t > i} - B] = E[|Xol] + BE[t] < ov. 


i>0 


Because all |Xp,7|s are dominated by the integrable random variable Y, we have E[|Xparli|Xnar|l > M] < 


EY |; |¥| = M] for all n. Since the right-hand side goes to zero as M — oo, we get the desired uniform integrability 


condition. 


Next, we have a result that’s a bit disconnected from the previous ones, in that we don’t even need uniform 


integrability at all: 


Theorem 199 


If X, is a nonnegative supermartingale, and 7 is any stopping time, then 


Proof. By Theorem 174, we have X, — X,, almost surely, so again X, is well-defined. But because E[X9] > E[Xnjz] 


for all n, Fatou’s lemma tells us that 


i[X,] < lim inf E 


Xnar] < {[Xo], 


as desired. 


We will now generalize Theorem 197 by looking at multiple stopping times at once: 


Definition 200 


If T is a stopping time with respect to F,, then the stopping time o-algebra associated to T is 


Fr ={AeEF:An{t =n} © fF, for all n}. 


This definition essentially says that an event A is in the stopping time o-algebra if we “know if we're in A at time 
T,” since any event in F, is “known” at stage n. We can check that if 7 = k almost surely, then F, = Fx, and (with 


more work) if we have a process Y, € F,, then Y, € F,. 
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Theorem 201 


Let Xp,7 be a uniformly integrable submartingale. If o, 7 are both stopping times and o < 7 almost surely, then 


{[X7|Fo] = Xo- 


(When o = 0, this is very similar to the inequality E[X,] > E[X,] that we've proved above.) 


Proof. Since Y, = Xnazr is a uniformly integrable submartingale and o is a stopping time, we know that E[Yo] < 
“TYS| < E[Y.] (by Theorem 197), which we can rewrite as E[Xo] < E[X,] < E[X,] because o < 7 almost surely. 
Now fix A € F, and define the random variable 


E(W) = 01, t+T1 yc. 
We always have either € = o or € = T, and we claim that € is a stopping time. Indeed, we can write 
{€é=n} = (An {o = nf) U(AS nN {rt =n}). 


Because A € Fg, the first term (AM {o = n}) is in F, (by the definition of the stopping time o-algebra). And the 


second term can be written as 
n 
Ao n{r=n}= (Warnie=snir=nt] 
k=0 


(since o < T by assumption), at which point we can notice that ASN {0 = k} © Fy C Fy and {tr = n} © Fy, so the 
entire right-hand side is also in F,. Putting this all together, {€ = n} € F,, so we do have a stopping time. Since € is 


bounded by 7 almost surely, we have E[Ye] < E[Y.] => E[Xe] < E[X,] by the same logic as for o. But writing out 
the definition of €, this tells us that 


2[Xo; A] + E[X,;; A°] < E[X;] => [E[X.; A]| < ELX,; A] =[EIE[X;|Fo]: Al] 


by the tower property. Since A is an arbitrary event in F,, this shows the desired result. 


We can now work with a quantitative example to see the optional stopping theorem in action: 


Example 202 (Asymmetric random walk) 


Consider a random walk on the integers defined by S, = >>;_, €;, where each €; is 1 with probability p and —1 


with probability g = 1— p. Assume that p € (0.5, 1) (so that there is a drift in the upward direction). 


We can rewrite S, as 
n 


Sn = > (6) — El€]) + 0- Elé]. 


i=1 


The first term here is the sum of n iid terms of mean zero, and it has standard deviation on the order of /n. But 


the second term is of order n, so it will dominate for large n and we will eventually stop returning to position 0. To 


Sn 
quantify this, notice that the equation 1 = E[s*] = ps + qz has solutions at s = 1 and Bi so My, = (2) , which is a 


product of iid terms of mean 1, will be a nonnegative martingale. Now for any integer x, let 


Ty, =inf{n>0:S, =x} 


103 


be the first hitting time of x, which can be infinite if the walk never reaches x. Now fix some a,b > 0 and define 
—a 
T = T_T». The stopped martingale M, is uniformly bounded (between (4) and (2) ), so E[M,] = E[Mp]. 


(Here, we are essentially using Theorem 197 and its analogous result for supermartingales.) Also, 7 is finite almost 


surely because the probability of not escaping (—a, b) is exponentially decaying in n. But now if we let a be the 


probability that the walk hits —a before b, we have 


; ay q)° 1-3) 
1=E[M] = iM_] = (2) + (1 n)- (2) T= a 


Taking a— oo, we find that the probability of never hitting b is P(t, = oo) = 0, and taking b > oo, we find that the 


p 
negative level has a positive probability of never being reached. 


a 
probability of hitting a is P(tT_a < co) = (2) . So in words, any positive level will be reached almost surely, and any 


Next, we ask about the expected amount of time it will take to hit any b > 0. For this, define a new martingale 


Xn = Sp — nNE[E] = Sp — n(p— q). 


Since S;, = b, it makes sense to expect that b—E[t,](p — q) = 0. To justify this, let S, be the minimum value of S, 


over all n > 0. This is a nonpositive random variable, and it is integrable because 


et Sal = FPSB a) = Tor» coe) (G) <a 


a>1 a>1 a 


because we have a geometric series with common ratio . <1. So Spin € Lt, and because Simin < Saat, < 6 for all 
n, Saaz, IS uniformly integrable (for example because it is dominated by |Spin| + 6). This means that S,,7, converges 
almost surely and in L? to its limit S,,, which is almost surely b by definition. So now turning back to our martingale 


Xn, this convergence in L? tells us that (because 0 < T,An<n) 


noo 


0 = E[Xo] = E[Xnar] = ElSnar,] — Ela A T.](p — 4) => 0 = b—- Elt,](p — @), 


because E[n A Tp] converges to E[7,] by the monotone convergence theorem. Thus E[t»] = a = 51° 


Example 203 (Patterns in a random string) 
Sample (01, 02,---) lid from the alphabet A = {A, B,--- , Z} (here |A] = 26). Let 7 be the first time we see a 


particular sequence of letters w, which we'll take to be “ABRACADABRA.” We wish to study E[r]. 


We have T > length(w) = 11, and we can check that E[T] < oo (because within any block of time there is a 


positive chance to see the word, so we have geometric decay). So 7 is indeed integrable. We construct a martingale — 
suppose that before each time n, a new gambler G, enters and bets 1 dollar on the event {o, = wi}. If on 4 wi, then 
G, loses and exits; otherwise G, wins $26. In the latter case, G,, bets all of the money on the event {0,41 = we}, either 
losing it or winning $26. This betting continues for G, until there is a mismatch (that is, the event {op4;-1 4 wj}), 
or the entire string w is correctly predicted, in which case the gambler wins $2611. 

Now, let M, be the total winnings up to time n from the point of view of the casino. This is integrable for any n 
(since any gambler’s winnings are bounded), and all games are fair, so M, is a martingale (with respect to F,, which 
encodes both the sequence and the bets). Also, E[|Mn+1 — M,||Fn] is uniformly bounded almost surely, because only 


the 11 most recent gamblers can be making bets (in particular we have a bound of 2611-11). Thus by Theorem 198, 
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if T is the stopping time where some gambler wins for the first time, 


0 = E[Mo] = E[M,] = E [7 — 26"* — 26* — 26], 


because when the game stops, the casino has won one dollar from each of the 7 gamblers, but at the stopping point, 


there are three gamblers who have won 26, 26%, 261! dollars respectively. So E[7] = 261! + 26* + 26 for this example 


(and this generalizes to any sequence w), giving us our answer. 


Next lecture, we'll move on to discussing reverse martingales and their applications to zero-one laws. 
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We'll study reverse martingales today, starting with two different (equivalent) definitions: 


Definition 204 


A reverse martingale is a martingale “indexed by the nonpositive integers Z<o.” In other words, a reverse 


martingale is a sequence of random variables (Mn)n<o such that (1) M, € L'(F,) for all n, where --- C F_2 C 


F_1 CO Fo CF, and (2) we have the usual martingale property M, = E[Mp+1|Fy]. 


Equivalently, we can index a reverse martingale in the usual way (Mn)n>0, but now we require F D Fy D Fy D 


F D---, and our martingale property is now E[M,|Fp41] = Mp1. We'll use this latter notation. 


Notice that if we have this nesting of o-fields F D Fp D---, then Fy 1 nso Fn, which Is itself a o-field which 
we call F,,. (So unlike with ordinary martingales, the the limit of the o-fields is also a o-field because we're taking 


intersections instead of unions.) 


The theory of reverse martingales is actually easier than normal martingales: for example, we have M, = E[Mo|Fy] 


for all n, so we have “less and less” information as we progress. 


Theorem 205 
If (Mn)n>o iS a reverse martingale with respect to a filtration Fy | Foo, then Mp > Max = E[Mo|Fo0.] almost 


surely and in L?. 


Proof. Let U,(a,b) be the number of upcrossings of [a, b] traversed by the process (M,,--- ,Mo) (this is going 


backwards in time, so this is a normal martingale). By Doob's upcrossing inequality, we have E[U,(a, b)] < Ee ale 


But the right-hand side is independent of n and finite (because Mo is integrable), so E[U,(a, b)] is bounded. Thus 


1 [U5.(a, b)] < 00 by the monotone convergence theorem, so by the same logic as in Theorem 173, M, must converge 


almost surely to M,.. Because M, = E[Mo|F,] for all n, (Mn)n>o is a collection of conditional expectations of a Mo 


and is thus uniformly integrable. Convergence in L? thus follows from Theorem 187. 

To check that the limit is indeed E[Mo|F,.], notice that for any A € Fy. C Fn, we have E[Mo; A] = E[M,; A 
because M, = E[Mo|Fn], and then convergence My, + Mz, in L+ implies that limMpsoo E[Mn; A] 3 E[M..; A] as well. 
So E[Mo; A] = E[Mx; A] for all A € F.., which is the conditional expectation identity for Mx. = E[Mo|F0]. 


Corollary 206 
For any random variable Y € L1(Q, F,P) and any o-fields F, | F.. (where F, C F for all n), E[Y|F,] converges 


to E[Y|F,.] almost surely and in L?. 
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Proof. M, = E[Y|F,] is a reverse martingale, so it converges almost surely and in L? to My = E[Mo|Fa] = 
ELE[Y |Fol|Foo], which is indeed E[Y|F.] by the tower property. 


b 


We'll spend the rest of the lecture on various zero-one laws, which tell us that certain types of events in a 


probability space must have probability either O or 1. 


Definition 207 
Let (Xj)j>1 be independent (but not necessarily iid) random variables on a probability space (92, 7, P), and define 
Fn+ = 0(Xn, Xnqi,°++) for all n. The tail o-algebra of (X;) is the o-field T = (),,31 Fnt- 


Example 208 
The event A = {X, > 3 1.0.} is in the tail sigma-algebra of the X,s, because it does not depend on the value of 


the first k random variables for any k. 


Formally, we can prove this by writing 


n>1m>n 
(in words, X, > 3 occurring infinitely often is the same as having Xm > 3 for some m > nno matter what n we pick). 
But because the inner union is decreasing as a function of n, we can rewrite 


n>£m>n 


for any 2. But this event is now in Fy, for all 2 (because it doesn't depend on Xj,--- , Xg-1), so A is also in the 


intersection 7. 


Theorem 209 (Kolmogorov 0-1 law) 
Let (Xj)j>1 be independent random variables on (Q,F,P), and let 7 be their tail o-algebra. Then 7 is trivial 


under P, meaning that P(A) =0 or 1 for all AE T. 


In words, if an event doesn’t change when we change only finitely many coordinates, it either always happens or 


never happens. 


Proof without martingales. Define Fy = o(X1,---,Xx) and Fimn = O(Xm.-++ ,Xn) — because all of the Xjs are 


independent, Fy is independent of Fin, as long as k < m<_n. Now fix k < m, and notice that U Finn |S a 


n>m 
m-system generating the o-algebra Fini, so F, is independent of F,,, by a m-X argument (since the collection of 


subsets independent of F;, is a A-system, and all of the m-system UJ Fim.n |S is independent of F,). 


n>m 
Next, Fim4 contains the tail sigma-algebra, so Fy LT for all k. Now U, Fx is a m-system (with all elements 
independent of 7) which generates 714, so F,4 is independent of 7 by another 7-X argument. But 7,4 contains 


T, so this really implies that 7 is independent of itself. Thus, for all A € 7, 


P(A) = P(AN A) = P(A), 


meaning that P(A) = 0 or 1 for all A, as desired. 


Proof with forward martingales. \We may argue as above to show that F, is independent of 7 for all k. For any AE 7, 


consider the martingale M, = E[1,4|F,]. By Theorem 192 M, converges almost surely and in L+ to E[14|F..] = 1, 
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(because T C Fy, so la € Fy). But A Is independent of F, for all n, so M, = E[1,] = P(A) for all n and thus we 
also have M,. = P(A). Setting the two limits equal, P(A) = 1,, so the probability can only be 0 or 1. 


The Kolmogorov 0-1 law is stated in terms of random variables, but we can also state it slightly differently. Suppose 


our probability space has a product structure 


(.F,P) = (IIs...) | 
i=1 i=1 i=1 
and define the random variables X; : (2 — S; via 
Xi(w = (wj)F21) = Wj 


(that is, returning the /th coordinate of w), so that the law of X; is exactly w;. The next result requires the X;s to be 


lid, but it will be stated similarly to the previous setting: 


Definition 210 
Suppose we have a product space (Q, F, P) = (Ss S) Baio Pai) (equivalently, a sequence of iid random 


variables). The exchangeable o-algebra € is the set of events invariant under permutation of finitely many 
coordinates (that is, a bijection 7 : N > N with fixed points at all but finitely many coordinates). Formally, for 
any w € Q= SN, define wr = (Wri) 721, and for any A € F, define Ar = {w_ : w € A}. Call A permutable if 
A =A, for all finite permutations, and let € = {A: A permutable}. 


The set of events invariant under permutations of the first n coordinates is often written 
En = {AE F: A=A, for all 7: N > N with m(/) =7 Vi > n}. 
In particular, notice that €, | €. 


Example 211 

We have 7 C €, because any event in the tail o-algebra is invariant under exchange of the first n coordinates 
for any n and thus invariant under all permutations of finitely many coordinates. However, 7 4 € — for example, 
A= {71 Xi € [9, 10] i.0.} is exchangeable but not in T in general. 


Theorem 212 (Hewitt-Savage 0-1 law) 
On a product space (Q, F, P) = (21 S$. B71 S, PZ, ws), the exchangeable o-field € is trivial under P (meaning 
that any event has probability 0 or 1). 


Proof without martingales. We start with a useful fact: 


Fact 213 
On any probability space (0, F,P), if Fi t Fa. C F, then for any A € Fo, we can find A; € F; such that 


P(AAA;) + 0 (where A is the symmetric difference). 


(This can be proven by a m-A argument, because the set of A for which this is true forms a A-system, and LU; F; 
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is a m-system.) We will apply this to our product space by defining (for all n) 


Fam {6 xox yx TTS Ba ees} 

i>n 
We have F, + F by definition of the product o-field, so for all A € F, we can find A, € F, such that P(AAA,) > 0 
(by Fact 213). Now define 7, : N > N to swap k and n+k for all 1 < k < n and keep the other coordinates fixed, 
and let B, = (An)x,- For any event A € €, we have 


P(AAB,) = P((AABn)x,) _ P(AAA,), 


where the first equality comes from the measure P being invariant under permutations of indices (because it is a 
product lid measure), and the second equality comes from A being permutable (we can check the details ourselves). 
On the other hand, A, depends only on the first n coordinates, while B, depends on coordinates n+ 1 through 2n, so 
An IL By. Thus, 

P(An 1 Bn) = P(An)P(Bn) => P(A) = P(A)’, 


because P(A,), P(Bn), P(An MN Bn) — P(A) if the symmetric differences go to zero. 


Proof with reverse martingales. Again, we start with a useful result: 


Lemma 214 
If G,H are any two sub-o-fields of F, and W is a random variable with W IL H and E[W|G] € H, then 


=[W] is constant. 


Proof of lemma. The conditional expectation Y = E[W|G] is G-measurable by definition and also H-measurable by 


assumption. Define the event 


A={Y -E[W]>e} €GnuH. 


Since A € G, the definition of conditional expectation tells us that 


[W; A] = E[Y; A] > (E[W] +) - P(A). 


But we also have A € H, and W is independent of H, so we also have 


2[W: A] = E[W]P(A). 


Putting these together, E(W)P(A) > (E(W) + €)P(A), which is a contradiction unless P(A) = 0. We can similarly 
show that Y — E[W] < —e with probability zero for any € > 0. Thus we indeed have Y = E[W] as desired. 


Turning to the proof, we will show that € LL F, for all k (with F, as defined in the previous proof), which implies 
that € IL F by a m-d argument. Since F contains €, this shows that € is independent of itself, so P(A)? = P(A) for 
all A € € just like before. 

To show that € IL F,, it suffices to show that for any bounded measurable function ¢ : S* —> R, if we let 
W = O(X1,--+ Xx) © Fx, then we have E[WI|E] = E[W]. (This is indeed sufficient because we can choose W to be 
the indicator for an arbitrary event in F,.) Because €, | €, we know that E[W|E,] — E[WJ|E] almost surely and in 


L+ by Corollary 206. Now for an alternate way of studying E[WJ|€,], consider the random variable A,(@) (for n > k) 
defined by 


An(¢) = an x O Aga ies 


i€[n]x 
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in which we take an average of ¢ over all permutations of the indices {1,--- , n} (and use the first k of them). Because 


of this averaging, permuting the first n indices does not change the value of An(@), so An(@) € En and thus 


1 
An($) |= E[An($)|En] = as So ElO(Xi. + Xi )IEn - 
* ie(nl 
But because the different Xjs are iid, all of these expectations are the same, so this reduces to E[@(X1,--- , X«)|En] = 


u[WIE,] |. Thus we can study convergence of A,(@) instead of E[WJE,] directly. We claim that limy+oo E[W|En] € 


F(x41)4. Indeed, consider the terms in A,(¢) as n + oo. The probability that (/,,--+ , 4) has any intersection with 
(1,--- ,k) goes to 0 as n + on (it’s at most = by a union bound). So the limiting expectation is independent of 
o(X1,--- , Xx) (since the total contribution to the expectation from terms including any of X1,--- , Xx is then at most 
k2 


sup(@), which goes to zero as n —+ 00), and thus liMp+oo E[W|En] © Fex4sy+- 
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However, remembering that this limit is also E[W|E], we can apply Lemma 214 with G = € and H = Fyx41)+. 


Specifically, W is indeed independent of F(x41)+ by construction (because it is F,-measurable), and E[W|E] € Fx41, 
so E[W|E] = E[W], finishing the proof. 


For a final application of the reverse martingale, recall the strong law of large numbers, which says that for any tid 


random variables X, X; with E[X] = , we have 3 = X; — ps almost surely. Here’s a shorter proof than the one we 


last showed: 


Alternate proof of Theorem 88. Let S, = X; +-:-+ Xp, be the usual partial sum, and define G, = (Sp, Spy1,--:) 


and G, | G (as usual, this is a o-algebra because it’s the intersection of o-algebras). Define M, = Sa, and notice that 


Sais Xn+2) Xn43) tees 


Sn 
E[MnlGn+1] =E =: 


Sn 
Snbts Ontos Saige?" / = |? 


But if we know the value of S,41, also knowing Xp+2, Xn43,°-- do not give us any additional information about Sa, so 


Sn 
O[MnlGn+1] = E [= 


S 
So = ae = Mp+1- 


So M, is a reverse martingale with respect to G,, meaning that M, converges almost surely to Mx = E[Mi|Goo] = 


E[X1|G..] by Theorem 205. But G, C E, for all n (because the values of S,, Sp41,:-- don’t change when we permute 


Xi,°++ Xn), $80 Goo C E. So by Theorem 212, E[X1|Goo] is a conditional expectation under a o-algebra in which all 


probabilities are 0 or 1, so it must be E[X;] = u almost surely. This is the desired result. 
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Today is our last lecture before the second exam, so we'll cover some new material and have quite a bit of review. 
(There is no class on Wednesday — our exam is in the evening instead.) 

Recall that if we have two finite measures p, v on (Q, F), we say that v is absolutely continuous with respect to 
(denoted vy < wy), if for all A € F with w(A) = 0, we also have v(A) = 0. Then a measurable function f : Q — [0, co) 
satisfying 


V(A) = f tau = [ Fue € A}du(w) 


for all A € F is called the Radon-Nikodym derivative of v with respect to w and denoted f = ame (An example 


of such a derivative from our proof of Cramér’s theorem is the exponential tilting oe -_ ee) ) The Radon- 


Nikodym derivative doesn’t exist for all u,v, but if it does, y must be absolutely continuous with respect to 4, because 
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given any set A with u(A) = 0, v(A) = J, fdw = 0. It turns out that the converse is also true (we stated this as 


Theorem 152 earlier in the course): 


Theorem 215 (Radon-Nikodym) 


If ju, v are o-finite measures on (Q, F) and v < ps, then there is a measurable function f such that v(A) = fi, fdu 
for all Ac F. 


Notice that f is unique jz-almost-surely, because otherwise there would be some set A of positive measure such that 
the integrals over A differ (if f, and f2 were two different Radon-Nikodym derivatives, consider A = {f, — fs > €} or 
A= {fi — fo < —e}). Earlier in the class, we used this theorem (without proof) to construct conditional expectation. 
Specifically, if X € L+(Q, F,P) and we have a sub-o-algebra G C F, then we can assume without loss of generality that 
X is nonnegative and define E[X|G] = om where w = Pig and v(A) = E,[X; A] for all Ae G. We can do this because 


vy is absolutely continuous with respect to w (if A has measure zero, then E[X; A] = 0 because any simple function has 


integral zero over A). This conditional expectation is G-measurable and satisfies the conditional expectation property 
by plugging in the definitions of uw and v into v(A) = J, fdu. 


To start, we'll prove a few basic properties of absolutely continuous measures: 


Lemma 216 
dx _ dm dv 


If we have finite measures T7<< uv < won (Q,F), then aim a aie 


pi-almost-surely. 


Proof. Notice that we do indeed have 7 < ps, because u(A) = 0 v(A) =0 m(A) = 0. We claim that for 


all nonnegative measurable functions h, we have 


dv 
hdv= | h—du. 
( . / du e 


Indeed, this identity holds for any indicator h = 1, (by definition of the Radon-Nikodym derivative), so it holds for 
simple functions by linearity, and then we can approximate from below and use the monotone convergence theorem in 


general. Next, because 7 is absolutely continuous with respect to v, we have 


dt 
m(A)= f av 


for any event A. But ode iS a nonnegative measurable function, so by the claim above, we have 


Because the Radon-Nikodym derivative is unique almost-surely under [, al must agree almost surely with 2 


aut 8 


desired. 


Plugging in yz for 7, or doing so while swapping the roles of jz and v, yields the following result: 


Lemma 217 
du 


-1 
Ifu<panduw<y, then 7 = (%) almost surely under both uw and v. 


The full proof of Radon-Nikodym can be found in appendix A.4 of Durrett, and it’s a bit outside the scope of 


the class. However, there is one case which is easy to prove, which we'll discuss now. Suppose we can partition our 
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probability space into countably many pieces Q = L]?°, ;, and F = 0(Q1, Q2,---). Then for any finite measures p, V 


on Q with vy < pu, we can define 


when w € Q; with w(Q;) > 0, 


a oe - whenw € Q; with \)= 
i u(Q;) = 0. 

(We can replace 0 in the second case with any other number — that case occurs with probability zero, so it doesn’t 
contribute to any relevant calculations.) We can indeed check both properties of the Radon-Nikodym derivative in this 
simple case, and thus we prove the theorem when we have a countable partition. 

More generally, suppose our o-algebra can be approximated with sets of the above form. In other words, consider 
a space (Q,F), where F, ¢ F and each F, is generated by a countable partition of Q. For example, if we take 
Q=Rand F=o0 (Ee re Z), then F, increases to the entire Borel o-algebra Bp (because the Borel o-algebra 
is generated by open intervals, and we can approximate open intervals with a countable union of these dyadic sets). 


So if we have two finite measures w, v on (Q, F) such that vy < wu, we can define pp (resp. v,) be the restriction of 


us (resp. v) to F,. We have Un < Vp for all n, so = exists by the above explicit construction, and it’s natural to ask 
din dv 
whether = eae fe 


Remark 218. /t’s possible to have ln < Vp but not u « v. For example, take uw to be the infinite product of 
Bernoulli(p) variables, and take v to be the infinite product of Bernoulli(q) variables (for p # q). Then we do have 
Lin < Vy for any n, but by the strong law of large numbers we do not have uw < v (the event that Sa — p has 
probability zero in v but not in yw). 


Lemma 219 
Let w,v be probability measures on (Q, F) with F, + F, and let Un = ulz,, Yn = vz, for all n. Suppose that 


Vn K< bn for all n, so Xp, = See is well-defined. (However, we're not assuming that vy < wu.) Then X, is a 


martingale with respect to F, on (Q,F, 2). 


Proof. Let E denote expectation with respect to 4. Because X, depends only on v, and fp, It is in Fp by definition, 


_ dy _ dvy _ 
Eyal / dit se / dbp a= 


€ Fp, so restricting 4 to Lp Is sufficient, and last equality by the Radon-Nikodym property 


SO 


diy 
din 


for A= Q). Thus X, € L1(F,), showing integrability. For the conditional expectation property, let A be any event in 
Fn. We have 


(middle equality because 


E[Xn41: A] = | Muses ‘ Muti Sa, 
A A 


where the middle equality comes from X,41 being in F,j11 and the last equality comes from the Radon-Nikodym 


definition. But ¥,41(A) = v(A) = v,(A) (all of these measures are restrictions of v”), so 


'[Xnu1/ A] = Yn(A) = | Xqditn = E[Xp: A], 


showing the conditional expectation property and verifying that we have a martingale. 


With this, we can answer our convergence question: 
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Theorem 220 
Let 4, v be probability measures on (Q, F) (again not assuming absolute continuity), such that F, ¢ F, and again 


define L, = Lz, and v, = v|z, for all n. If vu,» < pw, with X, = se for all n, then X = limsup,_,., Xn Satisfies 


v(A) = [xe + v(AN{X = oo}). 


The first term [,Xdu is also called Vontinuous(A) (because it is absolutely continuous with respect to w), while 
the second term v(AN {X = oo}) is called Ysinguiar(A) (meaning its support is disjoint from v7). Basically, X;, is a 
nonnegative martingale under the measure 4, so It converges almost surely to a finite limit under 4. However, it may 
not do so under the measure v, explaining the “boundary” second term. (Specifically, even though u(X = co) = 0, we 
may have v(X = co) > 0.) 

Once we prove this result, we'll have proved the Radon-Nikodym theorem in the special case where we have 
countable partitions F, + F increasing to the full o-algebra (since Vsingular IS identically zero if we already know that 


Vv <p, because then v(X = co) = 0). 


Proof. The key idea is to introduce a measure that interpolates between yz and v. Let p= wie and let pn = plz, = 


BoFvn for all n. By assumption, vp < Un < Pn (because pp is an average of 4, and vp, so any event with measure 


zero under p, must have measure zero under both 4, and p,), so the Radon-Nikodym derivatives 


_ abn _ In _ an 
du,’ "  dpn’ ~" dan 


Xn 


are well-defined. We have Y, + Zp = 2 Pn-almost-surely, and by Lemma 219, Y, and Zp, are F,-martingales on 
(Q,F,P). Furthermore, because they are nonnegative, they converge to finite limits Y, Z p-almost-surely, and thus 
Y + Z = 2 p-almost-surely. (In particular, Y and Z are both bounded, so we have strong convergence results.) We 
claim that Y = ae Indeed, fix some positive integer 2; for all n > 2 and any A € Fy C Fy, we have (by the definition 
of Y;, which is in Fp) 


H(A) |= tn(A) = | ¥nd pn = I Vado, 


which converges to | Y dp | by the bounded convergence theorem (notice that this last step doesn’t work for X). Thus, 
A 


for any A € Fy, we have (A) = f, dp, so by a m-A argument we have u(A) = [, Ydp for all AG F=0 (Uy Fr), 


so Y = =e Similarly, Z = a Now Un < Un < Pn, so by Lemma 216 we have ges = Sie See almost surely — that is, 


we have Zp = XnYp almost surely, so X, = a € [0, oo] (where X, = 0 only if Z, = 0, and X, = co only if Y, = 0). 


Thus, X = lIMSUPp 360 Xn = g (we have limsup instead of lim in case of division-by-zero issues). 
Turning back to the statement we are trying to prove, we know that v < p with Radon-Nikodym derivative Z, so 


we have 
(A) = f zdp= f zy +uy=opar= f zwvap+ | z1¢v = opae. 
A A A A 


where W = + when Y > 0 and 0 otherwise. (We break up the integral in this way because X = oo corresponds to 
Y = 0, and we're trying to get X into the integral.) But now the first term is [, XYdp = S,X#dp = [,Xdu, 
because ZW = X on the event that Y is positive (when Y = 0 there is no contribution to the integral), and we don't 
need to worry about the event {Y = 0} = {X = co} in this term now that we're integrating under z-measure because 
X is finite u-almost-surely. And because Z = a the second term simplifies to [, 1{Y = 0} dp =f, 1{X =coh}ay, 
which is indeed v(AM {X = oo}), as desired. 
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24 December 9, 2019 


There won't be any more assignments for this class — in this last week, we'll cover a few topics of general interest. 
Today, we'll go over the proof of the Kesten-Stigum L log L criterion — the initial work comes from various papers by 
Kesten and Stigum (including [5]), and many of the ideas covered in this lecture come from [7]. 

We'll start with a review of the setup: say that € ~ p is a probability distribution supported on Zso (€ is the L in 
the theorem statement). Define an array €,; of iid random variables with law p, and define the Galton-Watson tree 
by starting from a root vertex o (at depth 0) and letting the /th vertex at depth (n — 1) have €,; children at depth 
n. This generates a (finite or infinite) tree, and F, = o(&; :/ > 1,1 << n) encodes all the information about the 
tree. 


Let Z, be the population of the tree at depth n, which satisfies the recursive equation Z, = oar, Eni. If we 


assume that E/E] = m € (1, co) (so we're in the supercritical case), then we know from previous discussion that the 


probability of extinction (that is, the probability that Z, = 0 eventually) is some p € [0,1), meaning that there is a 
positive probability that the population never dies out. 

To understand the behavior of Z, in more detail, we know that W, = 42 iS a nonnegative martingale, so it 
converges almost surely to a limit W € [0,co). Because P(W = 0) satisfies the same fixed point equation as the 
extinction probability, it is either p or 1. Then if P(W = 0) = 1, then Z, < m" almost surely, and if P(W = 0) = p, 
we have P(W = 0|nonextinction) = 0 (because whenever Z = 0, we also have W = 0), so W is positive and finite 
almost surely. So conditioned on nonextinction, Z, = m” and we have exponential growth. 


Kesten-Stigum then tells us when each of these cases occurs (this is a restatement of Theorem 183) 


Theorem 221 (Kesten-Stigum) 


Suppose (Zp)n>0 iS a Supercritical Galton-Watson tree with offspring law € ~ p, where E[€] = m € (1,00). Let 


W, = 22 converge to W almost surely. Then the following are equivalent: 


EV Op, 


|W] = 1, 


E[Elog* €] is finite (where logt(0) = 0 but otherwise we have the usual log). 


On our homework, we showed that if 0 [€7] < oo, then the martingale W,, is bounded uniformly in L? norm, so the 


L? martingale convergence theorem says that W,, converges in L? to W, and hence also in L!. This means E[W] = 1, 
so we can't have P(W = 0) = 1 and thus must have P(W = 0) = p. This theorem is a stronger version of that result. 
One fact we'll use in the proof of Kestin-Stigum also comes from our homework: we showed that if Y, Yj are iid 


nonnegative random variables, then 


0 if E[Y] <x, 


Yn 
lim sup — = 
aoe oo $f E[Y] =oo. 


(The proof is that for any x > 0, 07°, P (% > x) is sandwiched between a[<| and E[~], which either both go to 


0 or both go to oo, and we can use Borel-Cantelli in either case.) As an easy consequence, we also have that if Y, Y; 


are nonnegative integer valued random variables, then for any c > 1, 


0 i[log™ Y] < co 


‘ ‘A 
lim sup — = 
noo © oo §=6Eflog' Y] =v. 
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(This is the same proof as before but rewriting P(“ > x) =P (log, (“) > nm) — we have to deal with the case Y, = 0 


aa 


separately, but that changes the expectation by at most 1.) This allows us to define a new process: 
Definition 222 
In a branching process with immigration, the population X, at level n is given by 


Xn-1 


Xn = S- Eni + Yn, 


=1 


where €,,; are lid as before and Y, Y; are iid (corresponding to the additional immigrating population). 


Fact 223 


On our homework, we showed that if E[log* Y] = oo, then lim SUP p00 is infinite for all c > 1, and If 


E[logt Y] < 00, then lim supy-400 a converges to a finite limit € (0,00). The first case is easy by the previous 


discussion, and the second can be shown with submartingale convergence theorem. 


We'll spend the rest of today linking these two results together. One key component comes from an exam question 
and was also proved in last lecture as Theorem 220: basically, if 4, v are finite measures on (Q,F), and we have a 
filtration F, + F, then we can write “uy, = ul¢, and v, = v|z,. The result is that if vy, < pp, for all n, then we can 
write 

V(A) = [wau + V(AN{W = co}) = “Vreontinuous + Ysingular: 


where W,, — a 


: and W = limsupW,. There are two extreme cases of this: having W = 0 u-almost-surely is the 
same as having Ucontinuous = 0 and VY = Veingular Singular with respect to ys, which is the same as having v(W = oo) = 1 
by setting A = Q. On the other hand, if i Wdyu = 1, meaning that Veontinuous(2) = 1, that’s equivalent to having 
absolute continuity v < uw and v(W = co) = 0. 

To prove Theorem 221, we'll apply this decomposition result on Q = {rooted graphs} (which contains the rooted 
trees) and F, = o(trees up to depth n). Q can be made into a Polish space under the isomorphism distance (also 
on our homework), and F, increases to F, the Borel sigma-algebra on the metric space. The idea is to let u = 
Galton-Watson measure with offspring law p (meaning that we just generate a Galton-Watson tree in the ordinary 
way under 44) and define v so that Se =W, = Zn; we'll then proceed by looking at this process under the measure 
y. In principle, v is already well-defined, since we know fin (and thus v,) for every n. But v itself is a bit confusing, 


and we don’t really know how to analyze a statement like v(W = 00) = 0 yet. So we'll do some preliminary work: 
¢ First of all, if 7, is the Galton-Watson tree T truncated at depth n, then for any event B € Fp, 
Zn 
VT, € B)=Un(Tn € B) = | 1{Tn € By abn 


by the definition of the Radon-Nikodym derivative. In particular, letting B be the event that T dies by level 
n, this integral is just f[Odu, = 0 (because Z, = 0 on B). So the tree survives forever under v (because 


v(extinction) = 0 — this is not true under w. 


+ We can define v, explicitly: we have (remembering that m = E[€]) 


diy €11 


dur 


meaning that the original measure is biased by the number of offspring at the first level. In other words, v1 Is the 


law of 7, with a size bias, and the probability of having k offspring is py = KPk for any k > 1. More generally, 
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here's a description of vy: let v* be a probability measure on pairs (Tn, Xn), where T,, is a tree of depth n and x, 


is a vertex at depth nin T, (meaning that we pick out a distinguished vertex at level n), given by 


Ln(Tn) 


* 
V*(Tp, Xp) = ——_—_ , 
n (Tn Xn) normalization 


where the normalization factor is 


ss Ln(Tn) = Sinn) s Le SC eal) Zatn) =m", 
Xn Th 


TnXn Tn 


because this is the expression for the expected number of children E[Z,] under the normal Galton-Watson 


measure. Now if , is the marginal of v= on T, (meaning that we forget which distinguished vertex we choose), 


we have T,) Z,(T) 
> * I 
Vin) = Bina) = a = [n(Tn) ar , 
Xn Xn 
In particular, this means that Ge = 2a = W,, SO Y, is indeed the v, that we've been looking for — vp, can be 


obtained from Ln by sampling a tree with a marked vertex and then forgetting the position of that vertex. 


Nothing here seems to actually help yet (we still don’t have a way to describe v), but the key observation is the 
following: we can sample (Tp, Xn) ~ Y% for each n (which we can think of as 7, with a (unique) marked path ¢,, from 


the root o to x,). We can also sample a tree (T, €) with a marked ray in the following way: 


* Start from the root and have it generate Ce 6 children (from the size-biased law). Choose one of them, x1, to 


be on the marked path. 
* For all other children except x ,, generate children according to the original law p. 


* Generate children for x2 according to the size-biased law 6, and choose one of them, x2, to be on the marked 


path. 


- Repeat this indefinitely, generating children according to p for any vertex not on the marked path and according 


to 6 for any vertex on it. 


We can continue this process indefinitely, producing an infinite tree with a marked ray — importantly, this is 
guaranteed to be an infinite tree because the size-biased law is supported on the positive integers, so an x; will always 
exist. The infinite tree (T,¢) has some law v,, and we claim that this agrees in law with (7), X,) — in other words, we 
claim that v.|¢, = v% for all n, where the right hand side is defined (as above) to be HolTa) Write the left-hand side 


as V,.|¢, = YZ — by our sampling process, we have 


kp 


fe} 1 fe} 
Un(Tn, Cn) = “m- k I fi) peeg(T ed), 


XX, 
because we generate k children for the root vertex under the size-bias law, pick one of them at random, construct a 
regular Galton-Watson tree rooted at x for all children x 4 x,_1, and perform the special sampling procedure up to 


level (n — 1) for the marked vertex x;. We can compare this recursion to the one for v=, which is 


UalTn Gn) = = TT tna TH), 
xX at depth 1 
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because other than the m” factor, we're just doing the normal Galton-Watson sampling. This can be rewritten as 


(x1) 
Mn-1(T 1) 
Yn(Tns Gn) = — {I be iT) a 1° 
XAX1 


Noticing that this is the same recursion as for vf, and the two measures agree at n = 1, induction tells us that v} = v7 
for all n. So now we have a measure v* on (T,C), and we can just forget ¢ by letting v be the marginal of v* on T. 
Then v restricted to F, is the marginal of v*|z,, which is the marginal of v*|z,, which is v, by our explicit calculations 
above, so we have the desired v|¢, = Vp. 

Now note that if (7,¢) ~ v*, then T \ ¢ is a branching process with immigration because the unmarked children 
coming off of ¢ (that is, the siblings of the marked children) are “immigrating” into the tree as Y;, Yo,---. Specifically, 
if X, is the number of children at level n of the tree except for the marked ray, we have 


Xn-1 


Xn = > bn 1 Xns 


i=1 


where the Y;,S are tid oe law € — 1 (since we have the size-biased distribution but don’t count the marked vertex). 
And by Fact 223, lim sup ~2 “ depends on E[log* Y], and 


u[€ log(é — 1)] 
= 


a[logt Y] = Ellog* (€ - 1)] = 


which is finite if and only if E[E log €] is finite! So we can now put everything together: Q is the space of rooted graphs, 
Fn are the sigma-algebras of the tree up to level n, uw is the Galton-Watson law with offspring law € ~ p, v* is the law 


on pairs (7, C) as defined above, and v is the marginal of v* on T. For all n, Un, Up, are the restrictions of u,v to Fy, 


and oa =W, = Zn . And now if (T,¢) ~ v*, then T \ ¢ is a branching process with immigration law Y = f= 1,56 


x 
i[Elogt £] < co =} Eflogt Y] < co => v* (im oT = ~) = 1 


by Fact a Since X, is the number of children at the nth level except for one of the vertices, we can replace Xs 
with W, = ae t 


an is negligible). And if an event occurs with probability 1 under v*, it also occurs 
with probability 1 under the marginal v. Thus we can continue the chain of statements and write 


(because the extra 


E[Elogt E] <00 == v (lim Wy < co) =4 — v<wpmand pwn a1. 


where the last step comes from the “extreme decomposition case” we mentioned above where P(W = co) = 0. So 
because P(W = 0) is either p or 1, but the latter case would yield [Wd = 0, E[élog* €] being finite is indeed 


equivalent to P(W = 0) = p. And finally, writing a similar chain in terms of the notation from before (but now 


assuming infinite E[€ log* €]), we have 


i[Elogt £] = co => Eflog' Y]}=00 => v%* (iim sup at = ~) = ”(limsupW,=c0) = 1. 


But W = lim sup W, being infinite almost surely corresponds to the other extreme case where v is singular with respect 


to £4, meaning W = 0 p-almost surely and thus E[W] cannot be 1. Therefore E[élog* €] being finite is also equivalent 
to E[W] = 1, as desired. 
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25 December 11, 2019 


Today, we'll cover the ¢(2) limit in the random assignment problem. We'll use some of the techniques we've seen 


in this class, and the results are very beautiful! There are two formulations of the problem that are equivalent: 


Problem 224 (Version 1) 
Let C € R°*" be a random matrix with iid entries each with distribution n- Exp (so in particular the entries have 


mean n). Define 


In other words, choose a rook-placement with minimum total sum of entries, and look at the average of those 
entries. We wish to study E[A,]. 


Problem 225 (Version 2) 

Consider a complete bipartite graph G,,, (where we have n vertices on the left and right and we draw all n* edges 
between the left and the right). Think of the vertices on the left side as indexing rows and vertices on the right 
side as indexing columns, so Cj; is the weight of the edge connecting / on the left to y on the right. Letting each 


edge have weight n- Exp as before, we can define 


where we take the minimum over perfect matchings M (which are a subset of edges such that every vertex is 


adjacent to exactly one of those edges), and again we want to study E[A,]. 


Here's the result (from [2]) that we'll be covering today: 


Theorem 226 (Aldous) 
In the setting above, we have limo E[An] = 172, + = C(2) = me = 1.6. 


Fact 227 


This answer was first conjectured in 1987 by Mezard and Parisi, and previous work had showed that 


1 
il . < 1.51 < liminf E[A,] < limsup E[A,] < 1.94 < 2. 


Various “famous people” worked on these bounds, so many people cared about the answer to this problem. The 


fact that the liminf and limsup are equal came from earlier work by Aldous in [1], and further work (see [6] 


and [8]) showed E[A,] = >>7_, 2. Today, we'll just show the original proof that the limit is = but improvements 


(see [11]) have further simplified the proof since then. 


First of all, we may notice that E[A,] does not scale with n, so we'll start by explaining that. If we look at this 


problem as a bipartite graph G,,, with weights C;; ~ n- Exp, then for any matching M, we can define its cost 
jj n 
t(M) = — S Ci ci): 
cos ( ) n ai i,M(i) 
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For any deterministic matching M, the expected value of the cost is pen n=n, but we're saying that the expected 
cost of the best matching still stays constant. To understand this, one heuristic is to think about the edges incident 
to a single vertex / on the left side, whose weights are n independent random variables C,,--- , Cy. By definition, for 
all t > 0, we have 

P(CG;>t)h=e/" => P ( min C; > t) = (e-vn)" =e, 


1<i<n 
which is the cdf of a standard exponential variable. Let Ci) be the smallest of the Cjs, C2) be the second smallest, 
and so on (these are all still connected to a fixed vertex /). Since the exponential distribution is memoryless — that is, 
P(C >s+t|C => s) =P(C = t) — we can condition on the value of Ci) = miny<j<n Cj. Then we have n— 1 weights 


left, and we know they’re all exponential and larger than Cmin, so the memoryless property tells us that 


d n 
C(2) = C1) + A=t =F . Exp 
n 


(here we use the fact that the minimum of k iid n- Exp random variables is distributed as ~ - Exp). Repeating this 


process, If (Cay, tee Cig) is the sorted list of weights attached to a vertex on the left, we have 


n n 
(Ceyaer Cin) = (En 6+ Be Ey 4 Fot+---4 nn) 


where the E;js are iid exponential random variables. As n — oo, this (heuristically) converges to (£1, Fy + Fo, FE, + 
E> + E£3,---) which is a collection of points on the real line that can be described by a Poisson point process of rate 
1 (in which the distance between points is given by an exponential random variable, and the number of points within 
any interval [a, b] is given by a Poisson distribution). So in summary, if we look at the edges incident to a specific 
vertex /, the first 100 smallest edges will be of constant order as n gets large, so it’s pretty reasonable to assume 
constant-order weight for each Cj iy) in our matching. 

This is just a heuristic, though, and what we've said doesn’t really imply that the limit of E[A,] should go to a 


constant. So we have to be more specific: 


Definition 228 
The Poisson-weighted infinite tree is constructed as follows: let [ be the random collection of points (E1, Ey; + 
Fo, E, + Eo + E3,---), where the EF; are iid exponential. In our tree, each vertex has infinitely many children 


indexed by N, and the weights of the connecting edges are randomly sampled iid as [ for each vertex. (So the 


first edge from each vertex is the lightest and given by Fj, the next edge is given by Fy + Eo, and so on.) 


We define a matching M on this infinite tree in the same way as on our graph — It’s a subset of the edges such 
that every vertex is covered exactly once. It turns out the graph G,, In our problem converges locally weakly to 
the Poisson-weighted infinite tree 7 — we did the calculation for depth 1 above, and there’s a lot of independence 
after that. So the main idea of Aldous’ proof is to consider the random bipartite graph (Gyn, M*), where M* is the 
best (lowest-cost) matching for Gyn. Since Gy,» converges locally weakly to T, and M* is just a collection of 1s and 
Os (telling us whether each of the edges are included), it seems reasonable to expect that we have the local weak 
convergence 

(Grin, M*) “S(T, Mt’). 


Trees are easier to analyze than complete bipartite graphs, so that motivates using the infinite tree to attack this 

problem! One way to get a matching on the tree T is to always pick the lightest edge from the root, and then for 

all children that aren't connected, connect them to their lightest child, and so on. Because the lightest edge has 
72 


expectation 1, the average cost of this matching must be 1, which is less than 4. So something has gone wrong — 
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the point is that M* has to arise from a local weak limit process, which is not true here! In particular, any local weak 
limit should be spatially invariant, because the choice of root of our tree should not play any special role in whether 


an edge belongs in the matching by the definition of local weak convergence. We'll use the next result without proof: 


Theorem 229 (Aldous) 
If Wg, mcg) denotes the weight of the edge from the root ¢ to M(@), then 


lim E[A,] = c* = inf 5[Wo,m(o)]- 


noo spatially invariant M 


In words, consider any spatially invariant matching M of our infinite tree. Because all edges are equivalent, the 


average cost can be defined as the expected cost of the edge next to the root! One direction of this theorem, showing 


that lim inf E[A,] > c*, is more straightforward, because we can use a local weak limit argument (which is essentially 
compactness). Specifically, we can show that a subsequence of (Gp,n, M*) converges locally weakly to (7, M) for some 
Spatially invariant matching M, and then the average cost for M must be at least c* because c* is optimal. But the 
other direction is more difficult because we have to produce a good approximation for the ideal c* on a finite graph, 
and we won't go through it here. 

Instead, in the rest of this lecture, we'll show that c* = Les Here’s a heuristic that the paper also starts off with 
(we're going to subtract infinity from itself, so there will be some issues with the rigor). Let the cost of the Poisson- 
weighted infinite tree T be the minimum cost of a matching on T (obviously every matching has infinite weight, but 


ignore that for now). Then we can write down a recursion based on what happens at the first level: 
cost(T) = min Woj + ¥, cost(T) + cost(T \ J) 
iz — 
iAj 


Basically, we pay the weight of the edge ¢, / connected to the root, plus the cost from subtrees 7‘) at level 1 for all 
i ~j, and then we can’t use vertex / itself for TY). If we compare this to the cost of the tree without the root vertex, 


we get a lot of cancellation because the cost of (T \ ¢) is just the sum of the costs T: 
cost(T) — cost(T \ ¢) = min (Wo, — (cost(T) — cost(T? \i))) 
jz 


Defining Xg = cost(T) — cost(T \ @) and X; = cost(T) — cost(T \ j), we get 


Xq = min (Woes — %4) | 


But if we have a spatially invariant matching, X¢ and X; should agree in distribution (because they are both defined 


to be the difference between the cost of a full tree and the tree without a particular vertex). 


Lemma 230 
Let 0 < G, < G <.--: bea Poisson point process (with (¢1, ¢2,---) 2 (E,,&, + Es,---)), and let X, X; be 


iid from some probability measure 4 on R. Then we have X a minjsi (Ci — Xj) if and only if w is the logistic 


s and cdi F(x) == 


distribution with density f(x) = (yee Tex: 


This lemma is proved by calculus, and once we know the distribution function of uw, we can calculate useful properties 


directly: 
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Lemma 231 
Let X, X; be lid from the logistic distribution 4, and define h(x) = P(X; + X2 > x) for all x > 0. Then 


7 ino) =k, I xh(x)dx = a 


To understand how this relates to the matching problem, we can go back to the boxed equation above. We 


can guess that in the optimal matching for our tree, the weight of the edge adjacent to the root has distribution 
W6,m*(¢) kl Ci«, where /* = argmin(¢; — Xi)i>1. So we will calculate E[¢j«], and hopefully that’s the number that we 
want — there’s various ways to do this, but we'll follow what the paper does. Consider the process (¢;, X;), which is 
a sequence of points scattered on the right half-plane — call the ¢-axis the z-axis and the X-axis the x-axis. Because 
the €|s form a Poisson point process, the number of points in any z-interval [a, b] is a Poisson random variable with 
rate b—a. This means that within any region of the zx-plane, the number of points inside this region is also Poisson. 
Specifically, (€;, X;) is a Poisson point process on (0, co) x R with intensity measure Leb @ w, where yu is logistic (that 
is, the Xjs are iid logistic) and 
#{points in R} ~ Pois((Leb @ w)(R)). 


for any region R. In particular, for any two disjoint regions R and S, behavior inside those regions is independent, so if 
we condition on the event Ay = {some point j of (¢;, X;) lands on the vertical line z = y} (we will actually condition 
on finding a point where z € [y, y+dy]), then the rest of the process is distributed as an independent Poisson process 
(and z = y is disjoint from the other regions). This allows us to compute the expected value of i* = argmin(¢; — X;). 
We claim that 

PO? =J1A,) =P (y ~ x; < min Gi ~ %). 


where the j is the same one as in the event Ay (which we know the value of because we're conditioning on Ay) and 
the ¢js and X;s are independently sampled from Xj. (In words, this is calculating the probability that if we find a 
point at horizontal coordinate y, it has the smallest value of €; — X; among all points in the process.) To explain this 
equation, notice that /* = j means that ¢; — X; is the minimal value among all ¢; — Xjs, but we condition on Gj = y 
and then the remaining (€;, X;) pairs are all independent of X; so we can reindex them as an independent Poisson point 
process. Write X= mins. (CG; — xe by Lemma 230, because the xe are logistic, so is x Putting this all together 


and remembering that X and X are independent, we find that 
P(* = j|A,) =P(y —X; < X)=P(X +X zy) =A) 


by Lemma 231. So now we can calculate the unconditioned probability density function for Cj» (which is what we're 


after) — to have Gj» € [y, y + dy], we must have the event Aly, y+ay) Occur, so 


P(¢ie € ly, y + dy]) = P(Ay y+ay) PU" =JlAy.y+ay) = A(y)dy 


(because the probability of having a point in the Poisson point process in a horizontal strip of length dy is dy). In 
other words, Cj« . W6,m«(¢) has density h, so the expected value of Wg iy+g on our infinite tree is just the expected 
value f[ xh(x)dx = a 

To conclude, we'll briefly discuss how to make the rest of the proof actually work. Our random variable /* is related 
to the matching on our infinite tree T, but we still need to construct the actual matching M*. (Morally, we want to 
sample the infinite tree and then solve the boxed recursion above for the X;s, but there are infinitely many equations 


and variables). So instead, the idea is to only look at the edge-weights of T up to some finite level n and solve for Xjs 
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within that, and we can avoid issues with dependence by using directed edges instead of undirected edges. Instead 
of the boxed recursion above, we now require the modified statement that 


Xyvow = min (Wysu- Xwou)- 


uAvV,w-u 


So we construct (7, X) (the infinite tree and the set of X; random variables) as follows. First of all, we sample the 
edge-weights of T up to some finite-level n, and we let W,,y be the weight along the edge connecting v with w (in 
both directions). We can then determine the directed edge weights Xy.w and Xw—y within level nas follows. At all 
vertices of level n, sample iid logistic random variables pointing to their children. Then applying the recursion at each 
level n vertex allows us to find the directed edge weights from level (n — 1) to n, and repeatedly doing this allows us 
to determine Xy_,y, for all edges pointing away from the root. After that, we can also determine the edge weights 
pointing towards the root by continuing to apply the recursion relation — finally, we can apply the Kolmogorov extension 
theorem to get the infinite construction (T, X). 

But having X in addition to T now allows us to construct the matching on T. Indeed, the edge between / and / is 
in the matching if 

J = argmings;(Wj,4 — Xksi): 


This gives us a unique neighbor for each vertex, and we can check that /’s neighbor is chosen to be J if and only if j’s 


neighbor is chosen to be /, because if / is chosen to be /’s neighbor, then 
WJ) — Xing < min, WU, k) — XK = Xfi 
kni kA 


by our modified recursion relation. In other words, an edge is included in the matching if the two X variables associated 
to it add to more than the weight W,;, and we can check that this will not hold for any other vertex! This thus gives 
us a construction with the desired weights (and thus a spatially invariant matching with expected weight ua which we 
can plug into Theorem 229), and we can finish the proof by showing that any deviation from this matching cannot 


satisfy the recursion relation. (But we can read the paper for more details!) 
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